Introduction to Open Data Science - Course Project

About the project

My git repository

date()
## [1] "Mon Dec  4 12:42:36 2023"

Heads-up: I am new with R. During this first week, I studied the 4 first chapters of the book made on R studio and watched the given tutorial videos.

I have learned how to:
- set my directory
- use the R project mode, the R markdown file and github commit
- install and use libraries
- identify and use the different data types including the factors
- use the pipe %>% (it means “and then…”)
- open data set, read data set, read the structure of the data set
- select rows, columns and match text or rename column name
- filter data based on variable values
- create a table / tibble
- create an object that is reusable later
- use the time format
- add columns with text, int, calculations, categories etc.
- group data with one or many variables to then perform calculation
- combine all data types to create a string with the function paste()
- check the NA and remove the NA to do calculations
- summarize to sum up data per variable value
- manipulate table from wide to long and long to wide
- combine 2 tables (functions: bind, inner_join, full_join, left_join, right_join)
- use the most common arithmetic functions and use percentage
- rank and reorder value
- create plots (bar, col, line, point, box, histogram)
- play with the different style and aesthetic possibilities
- add labels to plots
- combine plots together

With this tool I discovered how easy and fast it can be to manipulate and visualize data from a big data set.

Conclusion of Week 1:
The book is really well made. The style is easy to follow and understand. We can learn at our own pace and try the examples in R studio to assimilate the concepts well.
I am new with R and I find it hard to learn so many new different concepts in a week. I would have liked to have more exercises, without the solutions directly written under and with a tool to correct our lines of code. Finally, I understand it is not necessary to know by heart every function, but I would like to understand well the capabilities of this tool by practicing and doing more real life exercises.

Lines of code to summarize learning in week 1:

# Get the data set
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
gbd_full <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex-income.csv")
## Rows: 168 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): cause, sex, income
## dbl (2): year, deaths_millions
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# This is a code I wrote to know the pourcentage of deaths per cause for each year.
exercise4 <- gbd_full %>% 
  group_by(year, cause) %>% 
  summarise(total_per_cause= sum(deaths_millions)) %>% 
  group_by(year) %>% 
  mutate(total_per_year =sum(total_per_cause)) %>% 
  mutate(percentage = percent(total_per_cause/total_per_year)) %>% 
  select(year,cause,percentage) %>% 
  pivot_wider(names_from = cause, values_from = percentage)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
exercise4
## # A tibble: 7 × 4
## # Groups:   year [7]
##    year `Communicable diseases` Injuries `Non-communicable diseases`
##   <dbl> <chr>                   <chr>    <chr>                      
## 1  1990 33%                     9%       58%                        
## 2  1995 31%                     9%       60%                        
## 3  2000 29%                     9%       62%                        
## 4  2005 27%                     9%       64%                        
## 5  2010 24%                     9%       67%                        
## 6  2015 20%                     8%       72%                        
## 7  2017 19%                     8%       73%
# This is a code I wrote to know the number of deaths per sex and income and see the ratio Male / Female for each income type.
gbd_full %>% 
  filter(year ==1990) %>% 
  group_by(sex,income) %>% 
  summarize(deaths_per_sex_income = sum(deaths_millions)) %>% 
  pivot_wider(names_from = sex, values_from = deaths_per_sex_income) %>% 
  mutate(diff_M_F = Male/Female)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
##   income       Female  Male diff_M_F
##   <chr>         <dbl> <dbl>    <dbl>
## 1 High           4.14  4.46     1.08
## 2 Low            2.22  2.57     1.16
## 3 Lower-Middle   8.47  9.83     1.16
## 4 Upper-Middle   6.68  7.95     1.19

Chapter 2: Regression and model validation

date()
## [1] "Mon Dec  4 12:42:39 2023"
# set directory
setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

Thoughts about this week 2:

After reading all the chapters 1-7, I am now more confident to use R studio. I also understand the language better and I can do research on the web to use new function that I did not know.

It is very exciting to see how efficient is this tool and to think about all the analyzes we can do. I am an open university student and I can already see how to use this tool at work :).

Exercise: WRANGLING Please find the script file in the Github: create_learning2014_week2

Exercise: DATA ANALYSIS

QUESTION 1:

# Using the table from the course.
csv_table_read <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep = ",", header = T)

# library
library("tidyverse")
library("finalfit")
library("broom")


csv_table_read
##     gender age attitude     deep  stra     surf points
## 1        F  53      3.7 3.583333 3.375 2.583333     25
## 2        M  55      3.1 2.916667 2.750 3.166667     12
## 3        F  49      2.5 3.500000 3.625 2.250000     24
## 4        M  53      3.5 3.500000 3.125 2.250000     10
## 5        M  49      3.7 3.666667 3.625 2.833333     22
## 6        F  38      3.8 4.750000 3.625 2.416667     21
## 7        M  50      3.5 3.833333 2.250 1.916667     21
## 8        F  37      2.9 3.250000 4.000 2.833333     31
## 9        M  37      3.8 4.333333 4.250 2.166667     24
## 10       F  42      2.1 4.000000 3.500 3.000000     26
## 11       M  37      3.9 3.583333 3.625 2.666667     31
## 12       F  34      3.8 3.833333 4.750 2.416667     31
## 13       F  34      2.4 4.250000 3.625 2.250000     23
## 14       F  34      3.0 3.333333 3.500 2.750000     25
## 15       M  35      2.6 4.166667 1.750 2.333333     21
## 16       F  33      4.1 3.666667 3.875 2.333333     31
## 17       F  32      2.6 4.083333 1.375 2.916667     20
## 18       F  44      2.6 3.500000 3.250 2.500000     22
## 19       M  29      1.7 4.083333 3.000 3.750000      9
## 20       F  30      2.7 4.000000 3.750 2.750000     24
## 21       M  27      3.9 3.916667 2.625 2.333333     28
## 22       M  29      3.4 4.000000 2.375 2.416667     30
## 23       F  31      2.7 4.000000 3.625 3.000000     24
## 24       F  37      2.3 3.666667 2.750 2.416667      9
## 25       F  26      3.7 3.666667 1.750 2.833333     26
## 26       F  26      4.4 4.416667 3.250 3.166667     32
## 27       M  30      4.1 3.916667 4.000 3.000000     32
## 28       F  33      3.7 3.750000 3.625 2.000000     33
## 29       F  33      2.5 3.250000 2.875 3.500000     29
## 30       M  28      3.0 3.583333 3.000 3.750000     30
## 31       M  26      3.4 4.916667 1.625 2.500000     19
## 32       F  27      3.2 3.583333 3.250 2.083333     23
## 33       F  25      2.0 2.916667 3.500 2.416667     19
## 34       F  31      2.4 3.666667 3.000 2.583333     12
## 35       M  20      4.2 4.500000 3.250 1.583333     10
## 36       F  39      1.6 4.083333 1.875 2.833333     11
## 37       M  38      3.1 3.833333 4.375 1.833333     20
## 38       M  24      3.8 3.250000 3.625 2.416667     26
## 39       M  26      3.8 2.333333 2.500 3.250000     31
## 40       M  25      3.3 3.333333 1.250 3.416667     20
## 41       F  30      1.7 4.083333 4.000 3.416667     23
## 42       F  25      2.5 2.916667 3.000 3.166667     12
## 43       M  30      3.2 3.333333 2.500 3.500000     24
## 44       F  48      3.5 3.833333 4.875 2.666667     17
## 45       F  24      3.2 3.666667 5.000 2.416667     29
## 46       F  40      4.2 4.666667 4.375 3.583333     23
## 47       M  25      3.1 3.750000 3.250 2.083333     28
## 48       F  23      3.9 3.416667 4.000 3.750000     31
## 49       F  25      1.9 4.166667 3.125 2.916667     23
## 50       F  23      2.1 2.916667 2.500 2.916667     25
## 51       M  27      2.5 4.166667 3.125 2.416667     18
## 52       M  25      3.2 3.583333 3.250 3.000000     19
## 53       M  23      3.2 2.833333 2.125 3.416667     22
## 54       F  23      2.6 4.000000 2.750 2.916667     25
## 55       F  23      2.3 2.916667 2.375 3.250000     21
## 56       F  45      3.8 3.000000 3.125 3.250000      9
## 57       F  22      2.8 4.083333 4.000 2.333333     28
## 58       F  23      3.3 2.916667 4.000 3.250000     25
## 59       M  21      4.8 3.500000 2.250 2.500000     29
## 60       M  21      4.0 4.333333 3.250 1.750000     33
## 61       F  21      4.0 4.250000 3.625 2.250000     33
## 62       F  21      4.7 3.416667 3.625 2.083333     25
## 63       F  26      2.3 3.083333 2.500 2.833333     18
## 64       F  25      3.1 4.583333 1.875 2.833333     22
## 65       F  26      2.7 3.416667 2.000 2.416667     17
## 66       M  21      4.1 3.416667 1.875 2.250000     25
## 67       F  23      3.4 3.416667 4.000 2.833333     28
## 68       F  22      2.5 3.583333 2.875 2.250000     22
## 69       F  22      2.1 1.583333 3.875 1.833333     26
## 70       F  22      1.4 3.333333 2.500 2.916667     11
## 71       F  23      1.9 4.333333 2.750 2.916667     29
## 72       M  22      3.7 4.416667 4.500 2.083333     22
## 73       M  23      3.2 4.833333 3.375 2.333333     21
## 74       M  24      2.8 3.083333 2.625 2.416667     28
## 75       F  22      4.1 3.000000 4.125 2.750000     33
## 76       F  23      2.5 4.083333 2.625 3.250000     16
## 77       M  22      2.8 4.083333 2.250 1.750000     31
## 78       M  20      3.8 3.750000 2.750 2.583333     22
## 79       M  22      3.1 3.083333 3.000 3.333333     31
## 80       M  21      3.5 4.750000 1.625 2.833333     23
## 81       F  22      3.6 4.250000 1.875 2.500000     26
## 82       F  23      2.6 4.166667 3.375 2.416667     12
## 83       M  21      4.4 4.416667 3.750 2.416667     26
## 84       M  22      4.5 3.833333 2.125 2.583333     31
## 85       M  29      3.2 3.333333 2.375 3.000000     19
## 86       F  29      3.9 3.166667 2.750 2.000000     30
## 87       F  21      2.5 3.166667 3.125 3.416667     12
## 88       M  28      3.3 3.833333 3.500 2.833333     17
## 89       F  21      3.3 4.250000 2.625 2.250000     18
## 90       F  30      3.0 3.833333 3.375 2.750000     19
## 91       F  21      2.9 3.666667 2.250 3.916667     21
## 92       M  23      3.3 3.833333 3.000 2.333333     24
## 93       F  21      3.3 3.833333 4.000 2.750000     28
## 94       F  21      3.5 3.833333 3.500 2.750000     17
## 95       F  20      3.6 3.666667 2.625 2.916667     18
## 96       M  22      3.7 4.333333 2.500 2.083333     17
## 97       M  21      4.2 3.750000 3.750 3.666667     23
## 98       M  21      3.2 4.166667 3.625 2.833333     26
## 99       F  20      5.0 4.000000 4.125 3.416667     28
## 100      M  22      4.7 4.000000 4.375 1.583333     31
## 101      F  20      3.6 4.583333 2.625 2.916667     27
## 102      F  20      3.6 3.666667 4.000 3.000000     25
## 103      M  24      2.9 3.666667 2.750 2.916667     23
## 104      F  20      3.5 3.833333 2.750 2.666667     21
## 105      F  19      4.0 2.583333 1.375 3.000000     27
## 106      F  21      3.5 3.500000 2.250 2.750000     28
## 107      F  21      3.2 3.083333 3.625 3.083333     23
## 108      F  22      2.6 4.250000 3.750 2.500000     21
## 109      F  25      2.0 3.166667 4.000 2.333333     25
## 110      F  21      2.7 3.083333 3.125 3.000000     11
## 111      F  22      3.2 4.166667 3.250 3.000000     19
## 112      F  25      3.3 2.250000 2.125 4.000000     24
## 113      F  20      3.9 3.333333 2.875 3.250000     28
## 114      M  24      3.3 3.083333 1.500 3.500000     21
## 115      F  20      3.0 2.750000 2.500 3.500000     24
## 116      M  21      3.7 3.250000 3.250 3.833333     24
## 117      F  20      2.5 4.000000 3.625 2.916667     20
## 118      F  20      2.9 3.583333 3.875 2.166667     19
## 119      M  31      3.9 4.083333 3.875 1.666667     30
## 120      F  20      3.6 4.250000 2.375 2.083333     22
## 121      F  22      2.9 3.416667 3.000 2.833333     16
## 122      F  22      2.1 3.083333 3.375 3.416667     16
## 123      M  21      3.1 3.500000 2.750 3.333333     19
## 124      M  22      4.0 3.666667 4.500 2.583333     30
## 125      F  21      3.1 4.250000 2.625 2.833333     23
## 126      F  21      2.3 4.250000 2.750 3.333333     19
## 127      F  21      2.8 3.833333 3.250 3.000000     18
## 128      F  21      3.7 4.416667 4.125 2.583333     28
## 129      F  20      2.6 3.500000 3.375 2.416667     21
## 130      F  21      2.4 3.583333 2.750 3.583333     19
## 131      F  25      3.0 3.666667 4.125 2.083333     27
## 132      M  21      2.8 2.083333 3.250 4.333333     24
## 133      F  24      2.9 4.250000 2.875 2.666667     21
## 134      F  20      2.4 3.583333 2.875 3.000000     20
## 135      M  21      3.1 4.000000 2.375 2.666667     28
## 136      F  20      1.9 3.333333 3.875 2.166667     12
## 137      F  20      2.0 3.500000 2.125 2.666667     21
## 138      F  18      3.8 3.166667 4.000 2.250000     28
## 139      F  21      3.4 3.583333 3.250 2.666667     31
## 140      F  19      3.7 3.416667 2.625 3.333333     18
## 141      F  21      2.9 4.250000 2.750 3.500000     25
## 142      F  20      2.3 3.250000 4.000 2.750000     19
## 143      M  21      4.1 4.416667 3.000 2.000000     21
## 144      F  20      2.7 3.250000 3.375 2.833333     16
## 145      F  21      3.5 3.916667 3.875 3.500000      7
## 146      F  20      3.4 3.583333 3.250 2.500000     21
## 147      F  18      3.2 4.500000 3.375 3.166667     17
## 148      M  22      3.3 3.583333 4.125 3.083333     22
## 149      F  22      3.3 3.666667 3.500 2.916667     18
## 150      M  24      3.5 2.583333 2.000 3.166667     25
## 151      F  19      3.2 4.166667 3.625 2.500000     24
## 152      F  20      3.1 3.250000 3.375 3.833333     23
## 153      F  20      2.8 4.333333 2.125 2.250000     23
## 154      F  17      1.7 3.916667 4.625 3.416667     26
## 155      M  19      1.9 2.666667 2.500 3.750000     12
## 156      F  20      3.5 3.083333 2.875 3.000000     32
## 157      F  20      2.4 3.750000 2.750 2.583333     22
## 158      F  20      2.1 4.166667 4.000 3.333333     20
## 159      F  20      2.9 4.166667 2.375 2.833333     21
## 160      F  19      1.9 3.250000 3.875 3.000000     23
## 161      F  19      2.0 4.083333 3.375 2.833333     20
## 162      F  22      4.2 2.916667 1.750 3.166667     28
## 163      M  35      4.1 3.833333 3.000 2.750000     31
## 164      F  18      3.7 3.166667 2.625 3.416667     18
## 165      F  19      3.6 3.416667 2.625 3.000000     30
## 166      M  21      1.8 4.083333 3.375 2.666667     19
# analyze the structure of the dataset
str(csv_table_read)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# analyze the dimension of the dataset
dim(csv_table_read)
## [1] 166   7
# Missing data? No data is missing.
ff_glimpse(csv_table_read)
## $Continuous
##             label var_type   n missing_n missing_percent mean  sd  min
## age           age    <int> 166         0             0.0 25.5 7.8 17.0
## attitude attitude    <dbl> 166         0             0.0  3.1 0.7  1.4
## deep         deep    <dbl> 166         0             0.0  3.7 0.6  1.6
## stra         stra    <dbl> 166         0             0.0  3.1 0.8  1.2
## surf         surf    <dbl> 166         0             0.0  2.8 0.5  1.6
## points     points    <int> 166         0             0.0 22.7 5.9  7.0
##          quartile_25 median quartile_75  max
## age             21.0   22.0        27.0 55.0
## attitude         2.6    3.2         3.7  5.0
## deep             3.3    3.7         4.1  4.9
## stra             2.6    3.2         3.6  5.0
## surf             2.4    2.8         3.2  4.3
## points          19.0   23.0        27.8 33.0
## 
## $Categorical
##         label var_type   n missing_n missing_percent levels_n levels
## gender gender    <chr> 166         0             0.0        2      -
##        levels_count levels_percent
## gender            -              -
# summary statistics for each variable
missing_glimpse(csv_table_read)
##             label var_type   n missing_n missing_percent
## gender     gender    <chr> 166         0             0.0
## age           age    <int> 166         0             0.0
## attitude attitude    <dbl> 166         0             0.0
## deep         deep    <dbl> 166         0             0.0
## stra         stra    <dbl> 166         0             0.0
## surf         surf    <dbl> 166         0             0.0
## points     points    <int> 166         0             0.0
# Count per gender and percentage male / female
library("scales")
csv_table_read %>% 
  count(gender) %>% 
  mutate(total_percentage = n / nrow(csv_table_read)) %>% 
  mutate(total_percentage2 = percent(total_percentage))
##   gender   n total_percentage total_percentage2
## 1      F 110        0.6626506               66%
## 2      M  56        0.3373494               34%
# Mean and median for exercises points, and learning method per gender
summary(csv_table_read)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# The age varies from 17 to 55, mean is 25 and median 22. it suggests that there are some relatively higher values in the dataset
# The attitude varies from 1.4 to 5
# The points are from 7 to 33 and the mean is 22 and the median is 23. It suggests that there are some relatively lower values in the dataset
# we analyze the variables for both genders and females

# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(csv_table_read[-1])

# access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(csv_table_read, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

# some data shows that there could be correlation between some variables

# Relationship between points and attitudes
csv_table_read %>% 
  ggplot(aes(x= attitude, y= points)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

QUESTION 2, 3 and 4:

Female model

#female model
female_data <- csv_table_read %>% 
  filter(gender == "F")

View(female_data)


# Fit a multiple linear model for females. Let's check how points are influenced by age, attitude and deep learning approach
female_fitmodel <- lm(points ~ age + attitude + deep, data = female_data)
# In this model I want to check if age, attitude and deep impact points without impacting each other. 

# summary of std, p value and 
summary(female_fitmodel) 
## 
## Call:
## lm(formula = points ~ age + attitude + deep, data = female_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.058  -3.263   0.622   4.003  10.533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.59029    4.28982   3.168  0.00201 ** 
## age         -0.01743    0.06983  -0.250  0.80338    
## attitude     3.40151    0.70837   4.802 5.19e-06 ***
## deep        -0.27355    0.96270  -0.284  0.77685    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.353 on 106 degrees of freedom
## Multiple R-squared:  0.1791, Adjusted R-squared:  0.1559 
## F-statistic:  7.71 on 3 and 106 DF,  p-value: 0.0001043
summary(female_fitmodel)$r.squared
## [1] 0.1791183
#"age," "attitude," and "deep" explains about 18% of the variation of "points"

# p value intercept: it is significant as very small (0.002) and seems to play a significant role in the regression model
# baseline of model in 13.59 (estimate), when no factors are taken into account.
# age is not significant and is not correlated with points
# deep is not significant and is not correlated with points
# attitude is significant and it seems to play a significant role on the points.
# for one point increase in the attitude, the points increase by 3.63 (estimate)
# High std shows that the estimate is not so precise. It could due to sample size.

# I decide to drop the deep and the age variables and keep only the attitude.
female_fitmodel2 <- lm(points ~ attitude, data = female_data)
summary(female_fitmodel2) 
## 
## Call:
## lm(formula = points ~ attitude, data = female_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0557  -3.3486   0.6137   3.9819  10.3668 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.194      2.156   5.655 1.29e-07 ***
## attitude       3.389      0.701   4.835 4.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 108 degrees of freedom
## Multiple R-squared:  0.1779, Adjusted R-squared:  0.1703 
## F-statistic: 23.38 on 1 and 108 DF,  p-value: 4.442e-06
tidy(female_fitmodel2)
## # A tibble: 2 × 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)    12.2      2.16       5.66 0.000000129
## 2 attitude        3.39     0.701      4.83 0.00000444
summary(female_fitmodel2)$r.squared
## [1] 0.1779266
# p value is very low, same for the std, so this model is correct and justify the positive relation vs a positive attitude -> more points.
# rsquare is still quite low..
# The model doesn't provide a good fit for the data, and a significant portion of the variance is not explained. Is could be due to the sample size.

# autoplot: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# Identify issues with my regression model, such as non-linearity, non-normality, or influential data points

# autoplot doesnt knit.
#autoplot(female_fitmodel)
#autoplot(female_fitmodel2)

# I use plot function and which to get the desired plots.
plot(female_fitmodel,which = c(1,2,5))

plot(female_fitmodel2,which = c(1,2,5))

# we observe non normality at the end and beginning of the line in qq plot
# both models show that there are some points that are high leverage indicated on the residuals vs leverage

Male model

# male model
male_data <- csv_table_read %>% 
  filter(gender == "M")

View(male_data)
summary(male_data)
##     gender               age          attitude          deep      
##  Length:56          Min.   :19.0   Min.   :1.700   Min.   :2.083  
##  Class :character   1st Qu.:21.0   1st Qu.:3.100   1st Qu.:3.396  
##  Mode  :character   Median :24.0   Median :3.400   Median :3.792  
##                     Mean   :26.8   Mean   :3.443   Mean   :3.725  
##                     3rd Qu.:29.0   3rd Qu.:3.900   3rd Qu.:4.083  
##                     Max.   :55.0   Max.   :4.800   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 9.00  
##  1st Qu.:2.375   1st Qu.:2.312   1st Qu.:20.00  
##  Median :3.000   Median :2.625   Median :23.50  
##  Mean   :2.964   Mean   :2.704   Mean   :23.48  
##  3rd Qu.:3.531   3rd Qu.:3.167   3rd Qu.:28.25  
##  Max.   :4.500   Max.   :4.333   Max.   :33.00
# Fit a multiple linear model for males. Let's check how points are influenced by age, attitude and deep learning approach
male_fitmodel <- lm(points ~ age + attitude + deep, data = male_data)

# summary of std, p value and 
summary(male_fitmodel) 
## 
## Call:
## lm(formula = points ~ age + attitude + deep, data = male_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.8084  -3.3162  -0.0696   3.2195   9.9927 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.21897    6.06857   3.002 0.004112 ** 
## age         -0.16602    0.08456  -1.963 0.054974 .  
## attitude     4.31829    1.11699   3.866 0.000309 ***
## deep        -1.38378    1.22006  -1.134 0.261916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.271 on 52 degrees of freedom
## Multiple R-squared:  0.2718, Adjusted R-squared:  0.2298 
## F-statistic:  6.47 on 3 and 52 DF,  p-value: 0.000835
tidy(male_fitmodel)
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   18.2      6.07        3.00 0.00411 
## 2 age           -0.166    0.0846     -1.96 0.0550  
## 3 attitude       4.32     1.12        3.87 0.000309
## 4 deep          -1.38     1.22       -1.13 0.262
summary(male_fitmodel)$r.squared 
## [1] 0.2718164
# similar results than for the female. 
# All variables have a smaller p value than for in the female model. 
# rsquare is higher as it explains 27% but it is still quite low. It could be due to the sample size.

# I decide to drop the deep and the age as variables and keep only the attitude.
male_fitmodel2 <- lm(points ~ attitude, data = male_data)
summary(male_fitmodel2) 
## 
## Call:
## lm(formula = points ~ attitude, data = male_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6535  -2.9073  -0.5121   3.6974  10.2106 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.061      3.953   2.292  0.02581 *  
## attitude       4.189      1.129   3.711  0.00049 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.411 on 54 degrees of freedom
## Multiple R-squared:  0.2032, Adjusted R-squared:  0.1884 
## F-statistic: 13.77 on 1 and 54 DF,  p-value: 0.0004897
tidy(male_fitmodel2)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     9.06      3.95      2.29 0.0258  
## 2 attitude        4.19      1.13      3.71 0.000490
summary(male_fitmodel)$r.squared 
## [1] 0.2718164
# p value is very low, same for the std, so this model is correct and justify the positive relation vs a positive attitude -> more points.
# rsquare is higher as it explains 27% but it is still quite low
# The model doesn't provide a good fit for the data, and a significant portion of the variance is not explained. Is could be due to the sample size.

# autoplot: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# Identify issues with my regression model, such as non-linearity, non-normality, or influential data points

# autoplot doesnt knit !!
# autoplot(male_fitmodel)
# autoplot(male_fitmodel2) 

# plot with the plot function
plot(male_fitmodel,which = c(1,2,5))

plot(male_fitmodel2,which = c(1,2,5))

#The red line in residuals vs fitted stays quite close to the 0 line which is good
# both models show non normality. it is observed at the beginning of the qq plot
# both models show that there are some points that are high leverage indicated on the residuals vs leverage

other models tested:

test_fit1 <- csv_table_read %>% 
  lm(points ~ deep, data = .)
library(ggfortify)
summary(test_fit1)
## 
## Call:
## lm(formula = points ~ deep, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6913  -3.6935   0.2862   4.9957  10.3537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.1141     3.0908   7.478 4.31e-12 ***
## deep         -0.1080     0.8306  -0.130    0.897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared:  0.000103,   Adjusted R-squared:  -0.005994 
## F-statistic: 0.01689 on 1 and 164 DF,  p-value: 0.8967
tidy(test_fit1) # p value is small and significant
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   23.1       3.09      7.48  4.31e-12
## 2 deep          -0.108     0.831    -0.130 8.97e- 1
summary(test_fit1)$r.squared  # too low
## [1] 0.0001029919
test_fit2 <- csv_table_read %>% 
  lm(points ~ deep * gender, data = .)
library(ggfortify)
summary(test_fit2)
## 
## Call:
## lm(formula = points ~ deep * gender, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3247  -3.3338   0.3369   4.6242  10.6787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.36387    3.91828   5.708 5.33e-08 ***
## deep         -0.01001    1.06032  -0.009    0.992    
## genderM       2.67476    6.41719   0.417    0.677    
## deep:genderM -0.40787    1.71487  -0.238    0.812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.922 on 162 degrees of freedom
## Multiple R-squared:  0.00922,    Adjusted R-squared:  -0.009127 
## F-statistic: 0.5025 on 3 and 162 DF,  p-value: 0.6811
tidy(test_fit2) # p value is small and significant
## # A tibble: 4 × 5
##   term         estimate std.error statistic      p.value
##   <chr>           <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)   22.4         3.92   5.71    0.0000000533
## 2 deep          -0.0100      1.06  -0.00944 0.992       
## 3 genderM        2.67        6.42   0.417   0.677       
## 4 deep:genderM  -0.408       1.71  -0.238   0.812
summary(test_fit2)$r.squared # too low
## [1] 0.009220341

Basic data

# Female vs Male participants
csv_table_read %>% 
  ggplot(aes(x=gender)) +
  geom_bar()

# age chart and gender per age
csv_table_read %>% 
  ggplot(aes(x= age, fill = gender)) +
  geom_bar()

# age chart distribution per gender
csv_table_read %>% 
  ggplot(aes(x= age)) +
  facet_grid(~gender) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# age box plot distribution per gender
csv_table_read %>% 
  ggplot(aes(x= gender, y=age)) +
  geom_boxplot()

# relationship and distribution between age, points, and gender
csv_table_read %>% 
  ggplot(aes(y = points, x = age, colour = gender)) +
  geom_point() +
  labs(title = "Distribution of points per age and gender")

# with this data we can observe the different age points that drives the mean up (vs the median).

Gender analysis

Gender and points

# Distribution of the points per gender - histogram
csv_table_read %>% 
  ggplot(aes(x = points)) +
  geom_histogram() +
  facet_grid(~gender) +
  labs(title = "Histogram of points by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Distribution of the points per gender - boxplot
csv_table_read %>% 
  ggplot(aes(y = points, x = gender, colour = gender)) +
  geom_boxplot() +
  labs(title = "Boxplot of points by Gender")

#QQ plot - points per gender
csv_table_read %>% 
  ggplot(aes(sample = points)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean points per gender - this is not significant
csv_table_read %>%               
  t.test(points ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  points by gender
## t = -1.1832, df = 107.84, p-value = 0.2393
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -3.0896905  0.7799502
## sample estimates:
## mean in group F mean in group M 
##        22.32727        23.48214

Gender and attitude

# attitude vs gender
csv_table_read %>% 
  ggplot(aes(x=gender, y= attitude)) +
  geom_boxplot()

# Type histogram
csv_table_read %>% 
  ggplot(aes(x = attitude)) +
  geom_histogram() +
  facet_grid(~ gender) +
  labs(title = "Histogram of attitude by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# QQ plot: attitude per gender
csv_table_read %>% 
  ggplot(aes(sample = attitude)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean attitude per gender - This is significant and shows a difference between F and M on deep
csv_table_read %>%               
  t.test(attitude ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  attitude by gender
## t = -4.0932, df = 122.66, p-value = 7.657e-05
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.6718635 -0.2338508
## sample estimates:
## mean in group F mean in group M 
##        2.990000        3.442857

Gender and deep learning approach:

# deep learning approach vs gender
# We could do that for all approach of learning
# Type histogram
csv_table_read %>% 
  ggplot(aes(x = deep)) +
  geom_histogram() +
  facet_grid(~ gender) +
  labs(title = "Histogram of deep approach by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Type boxplot
csv_table_read %>% 
  ggplot(aes(y = deep, x = gender, fill = gender)) +
  geom_boxplot() +
  labs(title = "Boxplot of deep Approach by Gender")

# QQ plot: deep per gender
csv_table_read %>% 
  ggplot(aes(sample = deep)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean deep per gender - This is quite significant and could show a correlation between the gender and the approach deep
csv_table_read %>%               
  t.test(deep ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  deep by gender
## t = -0.72082, df = 101.32, p-value = 0.4727
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.2546963  0.1189279
## sample estimates:
## mean in group F mean in group M 
##        3.656818        3.724702

Age Analysis

Age and points

# does not seem to impact on points
csv_table_read %>% 
  ggplot(aes(x= age, y=points)) +
  geom_point()+ 
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Age and attitude

# does not seem to impact on attitude
csv_table_read %>% 
  ggplot(aes(x= age, y=attitude)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Learning approach: deep analysis

deep and age

# deep learning approach vs age - no correlation
# We could do that for all approach of learning
csv_table_read %>% 
  ggplot(aes(x= age, y= deep)) +
  geom_point()  +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

deep and points

# deep learning approach vs points - The deep approach seems to have a correlation with the number of points
csv_table_read %>% 
  ggplot(aes(x= deep, y=points)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

test_fit <- csv_table_read %>% 
  lm(points ~ deep * gender, data = .)
library(ggfortify)
summary(test_fit)
## 
## Call:
## lm(formula = points ~ deep * gender, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3247  -3.3338   0.3369   4.6242  10.6787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.36387    3.91828   5.708 5.33e-08 ***
## deep         -0.01001    1.06032  -0.009    0.992    
## genderM       2.67476    6.41719   0.417    0.677    
## deep:genderM -0.40787    1.71487  -0.238    0.812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.922 on 162 degrees of freedom
## Multiple R-squared:  0.00922,    Adjusted R-squared:  -0.009127 
## F-statistic: 0.5025 on 3 and 162 DF,  p-value: 0.6811
# deep seems to have a significant impact on "points."

Chapter 3

setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

library(broom)
library(ggplot2)
library(dplyr)
data <- read.csv("alc.csv",sep = ",", header = T)

class(data$high_use)
## [1] "logical"
class(data$famrel)
## [1] "integer"
class(data$goout)
## [1] "integer"
class(data$studytime)
## [1] "integer"

Question 3, 4 and 5

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)

Hypothesis 1: students with a good family relationship have less chance to have a high consumption of alcohol

Comments on the below codes chunk: There is a negative correlation between family relationship and the consumption of alcohol. When the relationship is not good (from 1 to 5), there are more risk to drink lot of alcohol. Pvalue is 0,0204 which is representative(<0.05). The coefficient for the variable Famrel in this model is -0.2838. The odds ratio associated with a one-unit change in Famrel is approximately exp(-0.2838) = 0.753. For a one-unit decrease in Famrel, the odds of ‘high_use’ decrease by about 1 - 0.753 = 24.7%. With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.49 and 3.35. For the famrel, the odds ratio lies between 0.59 and 0.95, since it is less than 1, there is a statistical significance, with a negative effect from the variable famrel to the high_use. The null deviance is bigger than the residual deviance of the model. It shows that the model explains a part of the variable high-use.

# I add the family = binomial as we do the analyze on a binary variable (True and False)
logistic_model_family_alcohol <-  glm(high_use ~ famrel, data = data, family = binomial)
summary(logistic_model_family_alcohol)
## 
## Call:
## glm(formula = high_use ~ famrel, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.2573     0.4853   0.530   0.5960  
## famrel       -0.2838     0.1224  -2.318   0.0204 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 446.67  on 368  degrees of freedom
## AIC: 450.67
## 
## Number of Fisher Scoring iterations: 4
# get the coef
coef(logistic_model_family_alcohol) %>% exp()
## (Intercept)      famrel 
##   1.2934265   0.7528865
# odd of being in high use - exponential of the coef
exp(coef(logistic_model_family_alcohol))
## (Intercept)      famrel 
##   1.2934265   0.7528865
# get the intervals
confint(logistic_model_family_alcohol) %>% exp()
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 0.4961949 3.3547129
## famrel      0.5911258 0.9570208
# get the variance, BIC and AIC
logistic_model_family_alcohol %>% 
  glance()
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          452.     369  -223.  451.  458.     447.         368   370
# If I want to check the impact of high use to family relationship -> famrel is ordinal so I use the ordinal regression model
library(ordinal)
## 
## Attaching package: 'ordinal'
## The following object is masked from 'package:dplyr':
## 
##     slice
ordinal_model_family_alcohol <- clm(factor(famrel) ~ high_use, data = data)
summary(ordinal_model_family_alcohol)
## formula: factor(famrel) ~ high_use
## data:    data
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  370  -454.70 919.40 5(0)  3.77e-08 2.0e+01
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## high_useTRUE  -0.5411     0.2139  -2.529   0.0114 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2  -4.0079     0.3672 -10.914
## 2|3  -2.7746     0.2194 -12.645
## 3|4  -1.3118     0.1422  -9.222
## 4|5   0.8468     0.1300   6.516
# This box plot confirms our hypothesis and conclusion. We can see that the mean and median of the family relationship quality is very different whether the students drink a lot or not. 
data %>% 
  ggplot(aes(x= factor(high_use), y= famrel)) +
  geom_boxplot() +
  labs(title = "Quality of family relationship vs the alcohol consumption",
       x = "High consumption",
       y = "quality of family relationship")

data %>%
  ggplot(aes(x = as.factor(famrel), fill = as.factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Family relationship and Alcohol Consumption",
    x = "Family relation score from 1 to 5",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# there are less score 5 and way more score 2 among the students who drink a lot
data %>%
  ggplot(aes(x = as.factor(high_use), fill = as.factor(famrel))) +
  geom_bar(position = "fill") +
  labs(
    title = "Family relation score distribution by Alcohol Consumption",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Family relation score from 1 to 5"
  )

Hypothesis 2: students who go more out with their friends have more risk to have a high consumption of alcohol

Comments on the below codes chunk: There is a a positive relationship between going out and the consumption of alcohol and a significant p value (very small value), which confirms the hypothesis. The more students go out the more risks they have to have a high consumption in alcohol.

The coefficient for goout is 0.7563.The odds ratio associated with a one-unit change in Goout is approximately exp(0.7563) = 2.13.For a one-unit increase in Goout, the odds of ‘high_use’ increase by about 2.13 - 1 = 1.13, which is about 113% increase.

With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.015 and 0.078. For the goout, the odds ratio lies between 1.71 and 2.66, since it is > 1, there is a statistical significance with a positive effect of this variable on the high_use.

The null deviance is way bigger than the residual deviance of the model. It shows that the model explains the variable high-use, even better than in the hypothesis 1. The AIC of this model is even smaller than the previous model too, which confirms my conclusion.

logistic_model_goout_alcohol <- glm(high_use ~ goout, data = data, family = binomial)
summary(logistic_model_goout_alcohol)
## 
## Call:
## glm(formula = high_use ~ goout, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.3368     0.4188  -7.968 1.62e-15 ***
## goout         0.7563     0.1165   6.494 8.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 403.05  on 368  degrees of freedom
## AIC: 407.05
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_goout_alcohol)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -3.34      0.419     -7.97 1.62e-15
## 2 goout          0.756     0.116      6.49 8.34e-11
# get the coef
coef(logistic_model_goout_alcohol) %>% exp()
## (Intercept)       goout 
##  0.03554946  2.13048020
# get the exp of the coef
exp(coef(logistic_model_goout_alcohol))
## (Intercept)       goout 
##  0.03554946  2.13048020
# get the intervals
confint(logistic_model_goout_alcohol) %>% exp()
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) 0.01515176 0.07850224
## goout       1.70566958 2.69518608
# plots
data %>% 
  ggplot(aes(x= factor(high_use), y= goout)) +
  geom_boxplot() +
  labs(title = "Going out with friends vs the alcohol consumption",
       x = "High consumption",
       y = "go out with friends (from 1 to 5)")

# the students who are drinking more are more numerous among the students who gave the score 4 and 5 for the goout variable.
data %>%
  ggplot(aes(x = factor(goout), fill = factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Going Out Behavior and Alcohol Consumption",
    x = "going out with friends from 1 to 5",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# there are more score 1, 2 and 3 among the students who do not drink a lot
data %>%
  ggplot(aes(x = factor(high_use), fill = factor(goout))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and going out behavior",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "going out with friends from 1 to 5"
  )

Hypothesis 3: students who study a lot have less risk to have a high consumption of alcohol

Comments about the below code chunk: The p value is very small and estimate is negative. We can conclude there is a negative relation between the study time and the alcohol consumption. When the study time increases the odds of high alcohol consumption decreases.

The coefficient for Studytime is -0.6208. The odds ratio associated with a one-unit change in Studytime is approximately exp(-0.6208) = 0.538. For a one-unit decrease in Studytime, the odds of ‘high_use’ decrease by about 1 - 0.538 = 46.2%.

With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.79 and 2.67. For the study time, the odds ratio lies between 0.39 and 0.72, since it is < 1, there is a statistical significance with a negative effect of this variable on the high_use.

The residual deviance is smaller than the null deviance, it means the model explains the variable high use. The AIC is smaller than the first model but not as small as the second model.

logistic_model_studytime_alcohol <- glm(high_use ~ studytime, data = data, family = binomial)
summary(logistic_model_studytime_alcohol)
## 
## Call:
## glm(formula = high_use ~ studytime, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.3649     0.3102   1.176    0.239    
## studytime    -0.6208     0.1542  -4.027 5.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 433.82  on 368  degrees of freedom
## AIC: 437.82
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_studytime_alcohol)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    0.365     0.310      1.18 0.239    
## 2 studytime     -0.621     0.154     -4.03 0.0000566
# get the coef
coef(logistic_model_studytime_alcohol) %>% exp()
## (Intercept)   studytime 
##   1.4403849   0.5375316
#Get the odds - exponential of coefficient: 
exp(coef(logistic_model_studytime_alcohol))
## (Intercept)   studytime 
##   1.4403849   0.5375316
# get the intervals
confint(logistic_model_studytime_alcohol) %>% exp()
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 0.7881099 2.6654451
## studytime   0.3933727 0.7208174
# This plot shows the difference in median and mean for the students who drink a lot or not
data %>% 
  ggplot(aes(x= factor(high_use), y= studytime)) +
  geom_boxplot() +
  labs(title = "Study time vs the alcohol consumption",
       x = "High consumption",
       y = "Study time (from 1 to 4)")

# way more students among students who score they study time around 1 and 2
data %>%
  ggplot(aes(x = factor(studytime), fill = factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Study time Behavior and Alcohol Consumption",
    x = "Study time from 1 to 4",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# way more score 1 and way less 3 and 4 among students who drink a lot.
data %>%
  ggplot(aes(x = factor(high_use), fill = factor(studytime))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and Study time behavior",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Study time from 1 to 4"
  )

Hypothesis 4: students who chose the school because of its reputation has less risk to have a high consumption of alcohol

Comments about the below code: The p value is 0,022 which tells that there is a correlation between the reason to choose a school and the consumption of alcohol.

We don’t know with the test which reason is the most chosen by the students who drink a lot of alcohol vs the ones who don’t. Some plots can help us identify it (check below).

# Creating a contingency table
chi_table_alcohol_reasons <- table((data$high_use), as.factor(data$reason))

# Performing the chi-squared test
chi_square_test <- chisq.test(chi_table_alcohol_reasons )
chi_square_test
## 
##  Pearson's Chi-squared test
## 
## data:  chi_table_alcohol_reasons
## X-squared = 9.563, df = 3, p-value = 0.02267
# Summary of the test
summary(chi_square_test)
##           Length Class  Mode     
## statistic 1      -none- numeric  
## parameter 1      -none- numeric  
## p.value   1      -none- numeric  
## method    1      -none- character
## data.name 1      -none- character
## observed  8      table  numeric  
## expected  8      -none- numeric  
## residuals 8      table  numeric  
## stdres    8      table  numeric
# This show that students who don't drink a lot have chosen their school more for its reputation than the one who drink a lot. For the ones who drink, the reasons home and other are way more important than for the students who don't drink too much. The reason courses seem to be equally important for them.
data %>%
  ggplot(aes(x = high_use, fill = reason)) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Reason to chose their school and Alcohol Consumption",
    x = "High Consumption of a",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

data %>%
  ggplot(aes(x = reason, fill = high_use)) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and Reason to chose their school",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Reason to choose the school"
  )

Other models to test:

By adding each variable together, we can see that the variation in high use is better explained (this is because the deviance is decreasing from 452 to 379).

All three variables (goout, famrel, and studytime) have statistical significance in predicting alcohol consumption (high_use).

logistic_model_3var <- glm(high_use ~ goout + famrel + studytime, data = data, family = binomial)
logistic_model_3var_2 <- glm(high_use ~ goout * famrel * studytime, data = data, family = binomial)
summary(logistic_model_3var)
## 
## Call:
## glm(formula = high_use ~ goout + famrel + studytime, family = binomial, 
##     data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7286     0.6941  -1.050  0.29381    
## goout         0.7777     0.1203   6.467 9.99e-11 ***
## famrel       -0.3777     0.1367  -2.763  0.00573 ** 
## studytime    -0.6139     0.1672  -3.673  0.00024 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 379.74  on 366  degrees of freedom
## AIC: 387.74
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_3var)
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.729     0.694     -1.05 2.94e- 1
## 2 goout          0.778     0.120      6.47 9.99e-11
## 3 famrel        -0.378     0.137     -2.76 5.73e- 3
## 4 studytime     -0.614     0.167     -3.67 2.40e- 4
# Anova conclusion: The p-value associated with the difference in deviance (Pr(>Chi)) is 0.1362, suggesting that the difference in fit between the two models is not statistically significant
anova(logistic_model_3var, logistic_model_3var_2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + famrel + studytime
## Model 2: high_use ~ goout * famrel * studytime
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       366     379.74                     
## 2       362     372.74  4    6.995   0.1362
# when comparing with for instanc: logistic_model_family_alcohol, the residual deviance of logistic_model_3var is better and the p is significant. 
anova(logistic_model_3var, logistic_model_family_alcohol, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + famrel + studytime
## Model 2: high_use ~ famrel
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       366     379.74                          
## 2       368     446.67 -2  -66.934 2.921e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# best model is the logistic_model_3var according to BIC and AIC, because metrics are smaller for this model
AIC(logistic_model_3var, logistic_model_3var_2)
##                       df      AIC
## logistic_model_3var    4 387.7372
## logistic_model_3var_2  8 388.7422
BIC(logistic_model_3var, logistic_model_3var_2)
##                       df      BIC
## logistic_model_3var    4 403.3912
## logistic_model_3var_2  8 420.0502

question 6: Model 1 (high_use ~ goout + famrel + studytime)

Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.

# prediction based on the model: logistic_model_3var
predictions <-  predict(logistic_model_3var, type = "response")

#Create the confusion matrix: 
table(data$high_use, predictions > 0.5)
##        
##         FALSE TRUE
##   FALSE   235   24
##   TRUE     63   48
# correct are False false + true true = 235 + 48 = 293
# wrong are False True and True False = 24 + 63 = 87

# Probability of miss classifications  = 87/(293+87) = 23% 
# odds of miss classification = 87/293 = 30%

# precision: correctly positive among all positives predictions = 0.33%
precision <- 24 / (24 + 48)
# recall: correctly positive among all actual positives = 0.28%
recall <- 24 / (24 + 63)
# F1 score: 0.30%
 2 * (precision * recall) / (precision + recall)
## [1] 0.3018868
 # Accuracy of always predicting the majority class:  if the majority class is "FALSE," then predicting "FALSE" for every instance would result in a correct prediction rate of 70%
mean(data$high_use == FALSE)
## [1] 0.7
# Calculate the total proportion of errors is 23.5%. This means my model is a bit better than the majority class which is 70% (30% of non-accuracy).
mean(data$high_use != (predictions > 0.5))
## [1] 0.2351351
# to conclude, my model is a bit better than using the prediction based on the majority class. But it should be improved for better accuracy!

Bonus:

I had to look at chat gpt fully for this one. I did not know the method. I now understand this method splits the data in 10 data samples and will use 9 of them to predict the last one, rotating to test them all. In that case, the average of prediction for the different tests is 74.9% which is better than the 70% (majority class).

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(lattice)
# Define the model formula
formula <- factor(high_use) ~ goout + famrel + studytime

# Define the number of folds for cross-validation
num_folds <- 10

# Create a training control with 10-fold cross-validation
train_control <- trainControl(method = "cv", number = num_folds)

# Train the model using 10-fold cross-validation
model <- train(formula, data = data, method = "glm", family = binomial, trControl = train_control)

# Get cross-validated results
results <- model$resample
str(results)
## 'data.frame':    10 obs. of  3 variables:
##  $ Accuracy: num  0.811 0.784 0.757 0.73 0.73 ...
##  $ Kappa   : num  0.479 0.345 0.287 0.232 0.317 ...
##  $ Resample: chr  "Fold01" "Fold02" "Fold03" "Fold04" ...
# Calculate mean prediction error
mean(results$Accuracy)
## [1] 0.7430575

Chapter 4

Question 1, 2 and 3

dataset description:

The Boston research seems to look for correlations between the standard of living, housing quality, and other criteria such as the urban environment, accessibility, and demographics within the Boston area.

Each row represents data for a specific town and has information about various factors such as crime rate, proportion of non-retail business acres, property tax rates, pupil-teacher ratios, etc.

This data frame contains the following columns:

  • crim - per capita crime rate by town.

  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus - proportion of non-retail business acres per town.

  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox - nitrogen oxides concentration (parts per 10 million).

  • rm - average number of rooms per dwelling.

  • age - proportion of owner-occupied units built prior to 1940.

  • dis - weighted mean of distances to five Boston employment centres.

  • rad - index of accessibility to radial highways.

  • tax - full-value property-tax rate per $10,000.

  • ptratio - pupil-teacher ratio by town.

  • black - 1000(Bk−0.63) square Bk is the proportion of blacks by town.

  • lstat - lower status of the population (percent).

  • medv - median value of owner-occupied homes in $1000s.

Data analyzes:

Explore the structure and the dimensions of the dataset. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, the distributions of the variables and the relationships between them.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)

setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

Boston <- Boston
# explore the mean, median, min, max per variable
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# 506 entries with 14 variables
dim(Boston)
## [1] 506  14
# Data type: all num or int
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# let's create points plot for each variable. -> not easy to read as too many variables
pairs(Boston)

# Let's analyze the histogram for each variable. To use the key facet wrap that will draw one chart per key, we need to have a table with 2 columns: the keys which are all the variables and then the metric. 
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Let's compute a correlation matrix
matrix_correlation_var <- cor(Boston)

# Visualize correlation matrix as a heatmap
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
ggplot(data = melt(matrix_correlation_var), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0)

according to the matrix - Negative relationships between certain variables:

lstat and medv - lower status of population and the median price of homes owned by occupants

lstat and rm - lower status of population and the average of room numbers per home

tax and medv - the percentage at which a property is taxed based on its assessed value and the the median price of homes owned by occupants

dis and lstat - the weighted average distance from each house to these employment centers and the lower status of population

dis and age - the weighted average distance from each house to these employment centers and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940.

dis and nox - the weighted average distance from each house to these employment centers and nitrogen oxides concentration.

dis and indus - the weighted average distance from each house to these employment centers and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).

tax and dis - the percentage at which a property is taxed based on its assessed value and the weighted average distance from each house to these employment centers

zn and age - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940

zn and nox - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the nitrogen oxides concentration.

zn and indus - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).

Positive relationships between variables

medv and rm - the median price of homes owned by occupants and the average of room numbers per home

tax and indus - the percentage at which a property is taxed based on its assessed value and the proportion of non-retail business acres per town.

tax and nox - the percentage at which a property is taxed based on its assessed value and the nitrogen oxides concentration.

age and indus - the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the proportion of non-retail business acres per town.

age and nox- the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the nitrogen oxides concentration.

nox and indus - the nitrogen oxides concentration and the proportion of non-retail business acres per town.

–> tests in my model: sltat, medv, rm, tax, dis, age, nox, indus, zn

Test some models based on the dataset and observed correlation

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
model1 <-  glm(lstat ~ medv + dis + rm,data=Boston)
model2 <-  glm(medv ~ rm + tax + lstat ,data=Boston)

# all the VIF are between 1 and 2.1 which is ok. It suggest a low multicollinearity and imply that the variance of the estimated coefficients is not significantly inflated due to collinearity
vif(model1)
##     medv      dis       rm 
## 1.982257 1.068810 1.940168
vif(model2)
##       rm      tax    lstat 
## 1.610953 1.426005 2.092901
#let’s calculate the corresponding ODDS ratios and confidence intervals (95%):
OR1 <- coef(model1) %>% exp
OR2 <- coef(model2) %>% exp
CI1 <- confint(model1) %>% exp
## Waiting for profiling to be done...
CI2 <- confint(model2) %>% exp
## Waiting for profiling to be done...
# the confidence interval for an odds ratio doesn't span 1 = there's a statistically significant effect for both models
cbind(OR1, CI1) 
##                      OR1        2.5 %       97.5 %
## (Intercept) 1.764803e+16 4.023272e+14 7.741285e+17
## medv        6.606608e-01 6.250484e-01 6.983022e-01
## dis         3.292580e-01 2.756488e-01 3.932934e-01
## rm          1.682751e-01 8.210665e-02 3.448749e-01
cbind(OR2, CI2)
##                     OR2        2.5 %      97.5 %
## (Intercept)   0.6073487  0.001289619 286.0319984
## rm          181.1868455 76.546116788 428.8744403
## tax           0.9935198  0.990167804   0.9968832
## lstat         0.5754723  0.522466602   0.6338557
# the residual deviance is way smaller than the null deviance. It indicates a reasonably good fit of the model to the data.
summary(model1)
## 
## Call:
## glm(formula = lstat ~ medv + dis + rm, data = Boston)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.40940    1.92918  19.391  < 2e-16 ***
## medv        -0.41451    0.02827 -14.662  < 2e-16 ***
## dis         -1.11091    0.09067 -12.252  < 2e-16 ***
## rm          -1.78215    0.36612  -4.868 1.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17.22406)
## 
##     Null deviance: 25752.4  on 505  degrees of freedom
## Residual deviance:  8646.5  on 502  degrees of freedom
## AIC: 2882.2
## 
## Number of Fisher Scoring iterations: 2
# medv and rm also influences each other, so let's modify the model a bit
model11 <-  glm(lstat ~ medv + dis + rm + rm * medv ,data=Boston)
summary(model11)
## 
## Call:
## glm(formula = lstat ~ medv + dis + rm + rm * medv, data = Boston)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.41052    3.64466   20.42   <2e-16 ***
## medv        -1.94316    0.13517  -14.38   <2e-16 ***
## dis         -0.89569    0.08285  -10.81   <2e-16 ***
## rm          -7.55541    0.59817  -12.63   <2e-16 ***
## medv:rm      0.22526    0.01957   11.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 13.64918)
## 
##     Null deviance: 25752.4  on 505  degrees of freedom
## Residual deviance:  6838.2  on 501  degrees of freedom
## AIC: 2765.5
## 
## Number of Fisher Scoring iterations: 2
# same for this one, residual deviance is way smaller than the null deviance. 
summary(model2)
## 
## Call:
## glm(formula = medv ~ rm + tax + lstat, data = Boston)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.498652   3.140239  -0.159 0.873895    
## rm           5.199529   0.439618  11.827  < 2e-16 ***
## tax         -0.006501   0.001724  -3.770 0.000182 ***
## lstat       -0.552564   0.049302 -11.208  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 29.90866)
## 
##     Null deviance: 42716  on 505  degrees of freedom
## Residual deviance: 15014  on 502  degrees of freedom
## AIC: 3161.4
## 
## Number of Fisher Scoring iterations: 2

Question 4

Standardizing the data. Variables often have different scales and units, making direct comparisons challenging. Standardization brings all variables to a common scale, allowing for fair comparisons between different variables. It makes the distribution of each variable more consistent, with a mean of 0 and a standard deviation of 1. This normalization aids in interpreting coefficients and comparing the relative importance of different predictors in regression. Standardizing ensures that each variable contributes equally to these techniques, preventing one variable from dominating the analysis due to its scale. It allows easier comparison of the magnitude of the effect of each variable on the outcome. Finally it can mitigate issues related to multicollinearity in regression analysis by putting variables on a similar scale, reducing the impact of differing scales on regression coefficients.

# select numerical values
Boston_numeric_cols <- Boston[, sapply(Boston, is.numeric)]

# The scale() to standardize and transform the data to have a mean of 0 and a standard deviation of 1.
scaled_boston <- scale(Boston_numeric_cols)

# convert to a data frame
scaled_table_boston <- as.data.frame(scaled_boston)

# how did the data change? Mean is now 0 so it has worked;
summary(scaled_table_boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# use the cut function to create categorical variables based on intervals or breaks in a numerical variable. We do this process for the crim variable from 0 to 0.25 to 0.50 to 0.75 to 1 (quantiles). Add labels for each category. 
# include.lowest = TRUE is to ensure there is no NA category.
quantiles <- quantile(Boston$crim, probs = seq(0, 1, by = 0.25), na.rm = TRUE)
interval_labels <- c("premier_quantil", "second_quantil", "third_quantil", "fourth_quantil")  

scaled_table_boston$quantiles_crime <- cut(Boston$crim, quantiles, labels= interval_labels,include.lowest = TRUE)
# Notes: Quantiles derived from numeric values can be considered categorical or continuous. When quantiles represent discrete categories or bins that divide a continuous variable into distinct groups, they are treated as categorical (e. small, medium, big). If quantiles represent numeric values that indicate the position or value relative to the distribution of a continuous variable (e.g., the 25th percentile, median, 75th percentile), they are considered continuous.

# drop the former column crim and create a new table. 
Boston_new <- scaled_table_boston %>% dplyr::select(-crim)

# We need 80% of the rows from total rows
train_size <- round(0.8 * nrow(Boston_new))

# Select a sample randomly among the dataset 80%
train_set <- sample(seq_len(nrow(Boston_new)), size = train_size)

# Create training and testing subsets
train_data <- Boston_new[train_set, ]
test_data <- Boston_new[-train_set, ]

Question 5:

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

Linear Discriminant Analysis (LDA) can be quite useful in various real-life scenarios where classification or dimensionality reduction is required such as: Marketing and Customer Segmentation, product quality control, credit scoring.

It is ideal when we have multiple variables that can help us categorize each entry in one specific group where all others will have similar mean, and variance. The goal is to predict the categorical outcome or class based on a set of predictors.

LDA transforms the original variables into a smaller set of new variables, known as linear discriminants. These new variables (linear discriminants) are created in a way that maximizes the separation between different categories or classes in the data.By using the information captured in the linear discriminants, LDA helps in accurately assigning new observations (data points) to their respective categories or classes. LDA takes variables, combines them in a smart way to create new variables that are helpful for understanding differences between groups or categories, and then uses this understanding to categorize or classify new data based on what it has learned. This makes it useful for both simplifying data and making predictions about categories in that data.

LDA calculates the mean and variance for each predictor within each class or category in the target variable. It finds coefficients for these combinations by considering differences in means between classes and assumes equal variances for each predictor within classes. It assumes that the data approximately follows a normal distribution.

# check the impact of each variable on a categorical variable (quantiles_crime)
# each quantiles are approximately equal 25%
# LD1 explains 96% of the model. When LD1 captures 96% of the variance, it suggests that LD1 effectively separates the classes or groups in the data based on their mean differences and variance. Within LD1, data points are grouped together in a way that maximizes the separation between classes while minimizing the variation within each class.
# Positive coefficients indicate that higher values of that variable contribute to a higher score in that particular discriminant, while negative coefficients suggest the opposite. The larger the magnitude of the coefficient, the greater the impact of that variable on the discriminant.

library(MASS)

# Fit LDA on the train set - LDA is a statistical method used for classification. It fits an LDA model using the train_data, where quantiles_crime is the target variable to predict based on the other variables
lda_model <- lda(quantiles_crime ~ ., data = train_data)

# Predict on the test set by using the function predict to apply the model created from the train_data to test_data -> the idea here is to predict the class labels for this test_data
lda_pred <- predict(lda_model, test_data)$class # class extract the predicted class labels

# actual crime quantiles
actual_crime_categories <- test_data$quantiles_crime

# Extract LD scores from the LDA model's predictions
lda_scores <- predict(lda_model, test_data)$x # x accesses the matrix of posterior probabilities or scores associated with each class for the observations in the test_data.

# Create a dataframe with LD scores from first 2 columns of the lda_score and the predicted classes. Combining LD1, LD2, and the predicted classes for visualization. In many cases, visualizing beyond LD1 and LD2 might become complex to display effectively in two-dimensional plots. LD1 and LD2 are typically chosen for visualization as they capture the most discrimination power while allowing for a clearer visualization on a 2D plot.
plot_data <- data.frame(LD1 = lda_scores[, 1], LD2 = lda_scores[, 2], prediction_crime_quantile = as.factor(lda_pred))

plot_data
##            LD1          LD2 prediction_crime_quantile
## 11  -1.7420309 -0.563988780            second_quantil
## 18  -1.8844395 -0.697588155            second_quantil
## 32  -1.7989232 -0.641298376             third_quantil
## 41  -3.3024128  2.332770951           premier_quantil
## 51  -2.6872906  0.702647342            second_quantil
## 63  -0.9375197  1.032234362            second_quantil
## 64  -1.1380939  0.895635746            second_quantil
## 68  -2.9480894  1.000450512            second_quantil
## 78  -2.1718156  0.367482743            second_quantil
## 79  -2.1020789  0.095429877            second_quantil
## 81  -2.8339574  1.043613561           premier_quantil
## 82  -2.4943084  0.852991483           premier_quantil
## 83  -2.7888610  1.180650907           premier_quantil
## 93  -2.4293889  0.768682385           premier_quantil
## 94  -2.6298188  0.975433516           premier_quantil
## 99  -3.5830580 -0.579652967            second_quantil
## 100 -3.3924426 -0.366355720            second_quantil
## 118 -1.2506981 -0.354287099            second_quantil
## 120 -1.2747237 -0.325676369            second_quantil
## 126 -2.3470023 -1.640284168             third_quantil
## 132 -1.4990246 -1.629460840             third_quantil
## 134 -1.3807754 -1.650234034             third_quantil
## 135 -1.1294140 -1.622673961             third_quantil
## 140 -1.3196570 -1.723866077             third_quantil
## 144 -0.5080338 -3.491441580             third_quantil
## 146 -0.2705193 -3.518305450             third_quantil
## 148 -0.3877692 -3.500527255             third_quantil
## 158 -1.3539425 -2.034963070             third_quantil
## 164 -2.0175909 -2.767803439             third_quantil
## 170 -1.3953191 -1.375795687             third_quantil
## 175 -1.8608627 -0.031188923            second_quantil
## 176 -2.3184829  0.030690369            second_quantil
## 177 -2.1065197  0.007678271            second_quantil
## 182 -2.5298343 -0.643970906            second_quantil
## 184 -2.4589965 -0.653150709            second_quantil
## 187 -2.8229685 -1.211514661             third_quantil
## 191 -2.5973701  1.225700683           premier_quantil
## 196 -2.8588061  1.707946563           premier_quantil
## 210 -1.9930583 -1.189720536            second_quantil
## 218 -1.6566282 -1.117301292             third_quantil
## 225 -0.7268859 -0.835236102             third_quantil
## 226 -0.6633513 -1.179021989             third_quantil
## 229 -1.1730584 -0.399556148             third_quantil
## 232 -0.7991376 -0.255471089             third_quantil
## 234 -0.7637663 -1.026402806             third_quantil
## 245 -1.1156092  0.873602233            second_quantil
## 249 -1.4723955  0.852167517            second_quantil
## 256 -4.2497725  2.545835892           premier_quantil
## 268 -1.8905155 -1.274270264             third_quantil
## 272 -3.2218642  0.838145928           premier_quantil
## 284 -4.6143460  1.573241774           premier_quantil
## 285 -4.2209652  2.403817499           premier_quantil
## 288 -2.1702161  2.212507868           premier_quantil
## 304 -1.8587022  1.431151136           premier_quantil
## 308 -1.0963826  1.136360108           premier_quantil
## 310 -2.0673156 -0.553660642            second_quantil
## 312 -2.3588626 -0.235652464            second_quantil
## 326 -2.5185568  0.241555955            second_quantil
## 328 -2.0504772 -0.182211776            second_quantil
## 338 -1.9274134 -0.158499006            second_quantil
## 350 -4.1510182  0.873990045           premier_quantil
## 351 -4.0497266  0.995543314           premier_quantil
## 355 -2.7916212  2.766354358           premier_quantil
## 358  6.1825650 -0.836499387            fourth_quantil
## 367  6.9557388 -0.061192076            fourth_quantil
## 371  6.1538773 -0.737819596            fourth_quantil
## 373  6.5663619 -1.096027676            fourth_quantil
## 377  6.6204308  0.200138729            fourth_quantil
## 380  6.6120904  0.407812674            fourth_quantil
## 381  6.2930988  0.599486138            fourth_quantil
## 385  7.2695019  0.097548262            fourth_quantil
## 387  7.0960326  0.018298618            fourth_quantil
## 397  6.5453756  0.227228952            fourth_quantil
## 410  6.9330625  0.130268208            fourth_quantil
## 411  6.9725952  1.202108257            fourth_quantil
## 412  7.0569165  0.569561509            fourth_quantil
## 413  7.7509117  0.137456881            fourth_quantil
## 423  6.6198484  0.708563545            fourth_quantil
## 424  7.0644996  0.659906485            fourth_quantil
## 427  6.5551119  1.583621100            fourth_quantil
## 431  6.7293783  1.030784929            fourth_quantil
## 433  6.4745453  1.259962518            fourth_quantil
## 435  6.9353222  0.244550044            fourth_quantil
## 437  7.0841354  0.047739915            fourth_quantil
## 438  7.4187787 -0.279592429            fourth_quantil
## 441  6.7356991 -0.077765024            fourth_quantil
## 443  6.7194681 -0.341344771            fourth_quantil
## 444  6.6920760 -0.304925027            fourth_quantil
## 446  7.2518690 -0.311106112            fourth_quantil
## 447  6.7547338 -0.209276638            fourth_quantil
## 448  6.5858542 -0.047964040            fourth_quantil
## 453  6.5581085 -0.033751653            fourth_quantil
## 460  6.4962033 -0.073509396            fourth_quantil
## 464  6.3691960  0.010711213            fourth_quantil
## 471  6.2523209  0.800009717            fourth_quantil
## 484  5.6806078  1.592947743            fourth_quantil
## 491 -0.9753120 -1.722043654            second_quantil
## 494 -1.2172576 -0.513483292             third_quantil
## 497 -0.8234830 -0.967500760             third_quantil
## 499 -1.1581295 -0.643361810             third_quantil
## 501 -1.0697983 -0.621807365             third_quantil
# Create a scatterplot of LD1 and LD2
plot_LDA <- ggplot(plot_data, aes(x = LD1, y = LD2, color = prediction_crime_quantile)) +
  geom_point() +
  labs(title = "LDA Biplot with Predicted Crime Quantiles")

plot_LDA

# adding real values - comparison of actual vs predicted values in test_data
realVsPred_plot <- plot_LDA + 
  geom_point(aes(color = actual_crime_categories), size = 4, alpha = 0.1) +
  labs(color = "Real Quantiles of Crime")

realVsPred_plot

# the accuracy of predictions using test data
accuracy <- mean(lda_pred == test_data$quantiles_crime)
print(paste("Accuracy of LDA model on test data:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy of LDA model on test data: 77.23 %"

Question 6:

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)

# save the crime data: 
actual_crime_categories <- test_data$quantiles_crime

# Remove the categorical crime variable from the test dataset
test_data_without_crime <- test_data %>% dplyr::select(-quantiles_crime)

# get the classes based on the model - this was done earlier with lda_pred. so I am a bit confused.
lda_pred_test <- predict(lda_model, test_data_without_crime)$class

# get the table with the prediction vs actual - results are same between the 2 ways since I did 2 times the same steps. I might have missed something in the requests.
cross_tab <- table(Predicted = lda_pred_test, Actual = actual_crime_categories)
cross_tab
##                  Actual
## Predicted         premier_quantil second_quantil third_quantil fourth_quantil
##   premier_quantil              15              3             0              0
##   second_quantil                8             14             4              0
##   third_quantil                 2              4            17              1
##   fourth_quantil                0              0             1             32
cross_table <- table(Predicted = lda_pred, Actual = actual_crime_categories)
cross_table
##                  Actual
## Predicted         premier_quantil second_quantil third_quantil fourth_quantil
##   premier_quantil              15              3             0              0
##   second_quantil                8             14             4              0
##   third_quantil                 2              4            17              1
##   fourth_quantil                0              0             1             32

Question 7:

Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

# Boston is reload
Boston_load <- Boston

# scale of Boston
Boston_scaled <- scale(Boston_load)

# Calculate distance betwwen scaled data points: 
distances <- dist(Boston_scaled)

# Visualize the distances using fviz_dist()
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_dist(distances, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# k mean: The k-means algorithm groups observations into clusters based on their similarity, aiming to minimize the variation within clusters while maximizing the variation between clusters. We need to define the number of clusters we need to do the kmean analyse. 

# Elbow Method to find optimal number of clusters: Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
# The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
# The WSS score is the sum of these Squared Errors for all the points.
wss <- numeric(10)  # Initialize within-cluster sum of squares vector
for (i in 1:10) {
  kmeans_model <- kmeans(Boston_scaled, centers = i, nstart = 10)
  wss[i] <- kmeans_model$tot.withinss  # Store within-cluster sum of squares
}

# Plotting the Elbow Method
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares", main = "Elbow Method")

# or we can use the direct function for the Elbow method
kmeans_model_optimal2 <- fviz_nbclust(Boston_scaled, kmeans, method = "wss")
# or we can use the silhuette method
fviz_nbclust(Boston_scaled, kmeans, method = "silhouette")

# Visualize clusters using pairs() or ggpairs()
pairs(Boston_scaled, col = kmeans_model$cluster)

k <- 2  # 2 seems to be the best option according to the Elbow Method and the silhouette method 

#Kmean model:
kmeans_model <- kmeans(Boston_scaled, centers = k, nstart = 25)

cluster_assignments <- kmeans_model$cluster

# visualize the clusters thanks to fviz_cluster function
fviz_cluster(kmeans_model, data = Boston_scaled)

library(GGally)

# Combine the scaled data with the cluster assignments (cluster 1 or 2)
clustered_data <- cbind(as.data.frame(Boston_scaled), Cluster = as.factor(cluster_assignments))

# Visualize clusters using ggpairs()
ggpairs(clustered_data, aes(color = Cluster))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Mean for each cluster and variable:
clustered_data %>%
  mutate(Cluster = kmeans_model$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
## # A tibble: 2 × 15
##   Cluster   crim     zn  indus     chas    nox     rm    age    dis    rad
##     <int>  <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1       1 -0.389  0.262 -0.615  0.00291 -0.582  0.245 -0.433  0.454 -0.583
## 2       2  0.724 -0.487  1.14  -0.00541  1.08  -0.455  0.805 -0.844  1.08 
## # ℹ 5 more variables: tax <dbl>, ptratio <dbl>, black <dbl>, lstat <dbl>,
## #   medv <dbl>
model_predictors <- train_data %>% dplyr::select(-quantiles_crime)

# check the dimensions
dim(model_predictors)
## [1] 405  13
dim(lda_model$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_model$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ordinal':
## 
##     slice
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Chapter 5

Question 1

Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

library(tibble)
library(readr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(tidyr)
library(GGally)

human_new <- read_csv("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project/human_new.csv", show_col_types = FALSE)

str(human_new)
## spc_tbl_ [155 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ HDI.Rank      : num [1:155] 1 2 3 4 5 6 6 8 9 9 ...
##  $ Country       : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ HDI           : num [1:155] 0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
##  $ Life.Exp      : num [1:155] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp       : num [1:155] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Edu.Mean      : num [1:155] 12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
##  $ GNI           : num [1:155] 64992 42261 56431 44025 45435 ...
##  $ GNI.Minus.Rank: num [1:155] 5 17 6 11 9 11 16 3 11 23 ...
##  $ GII.Rank      : num [1:155] 1 2 3 4 5 6 6 8 9 9 ...
##  $ GII           : num [1:155] 0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
##  $ Mat.Mor       : num [1:155] 4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth     : num [1:155] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F       : num [1:155] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  $ Edu2.F        : num [1:155] 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ Edu2.M        : num [1:155] 96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
##  $ Labo.F        : num [1:155] 61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
##  $ Labo.M        : num [1:155] 68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
##  $ Edu2.FM       : num [1:155] 1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM       : num [1:155] 0.891 0.819 0.825 0.884 0.829 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   HDI.Rank = col_double(),
##   ..   Country = col_character(),
##   ..   HDI = col_double(),
##   ..   Life.Exp = col_double(),
##   ..   Edu.Exp = col_double(),
##   ..   Edu.Mean = col_double(),
##   ..   GNI = col_double(),
##   ..   GNI.Minus.Rank = col_double(),
##   ..   GII.Rank = col_double(),
##   ..   GII = col_double(),
##   ..   Mat.Mor = col_double(),
##   ..   Ado.Birth = col_double(),
##   ..   Parli.F = col_double(),
##   ..   Edu2.F = col_double(),
##   ..   Edu2.M = col_double(),
##   ..   Labo.F = col_double(),
##   ..   Labo.M = col_double(),
##   ..   Edu2.FM = col_double(),
##   ..   Labo.FM = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Country is as rownames
human_new <- column_to_rownames(human_new, "Country")

is.na(human_new)
##                                           HDI.Rank   HDI Life.Exp Edu.Exp
## Norway                                       FALSE FALSE    FALSE   FALSE
## Australia                                    FALSE FALSE    FALSE   FALSE
## Switzerland                                  FALSE FALSE    FALSE   FALSE
## Denmark                                      FALSE FALSE    FALSE   FALSE
## Netherlands                                  FALSE FALSE    FALSE   FALSE
## Germany                                      FALSE FALSE    FALSE   FALSE
## Ireland                                      FALSE FALSE    FALSE   FALSE
## United States                                FALSE FALSE    FALSE   FALSE
## Canada                                       FALSE FALSE    FALSE   FALSE
## New Zealand                                  FALSE FALSE    FALSE   FALSE
## Singapore                                    FALSE FALSE    FALSE   FALSE
## Sweden                                       FALSE FALSE    FALSE   FALSE
## United Kingdom                               FALSE FALSE    FALSE   FALSE
## Iceland                                      FALSE FALSE    FALSE   FALSE
## Korea (Republic of)                          FALSE FALSE    FALSE   FALSE
## Israel                                       FALSE FALSE    FALSE   FALSE
## Luxembourg                                   FALSE FALSE    FALSE   FALSE
## Japan                                        FALSE FALSE    FALSE   FALSE
## Belgium                                      FALSE FALSE    FALSE   FALSE
## France                                       FALSE FALSE    FALSE   FALSE
## Austria                                      FALSE FALSE    FALSE   FALSE
## Finland                                      FALSE FALSE    FALSE   FALSE
## Slovenia                                     FALSE FALSE    FALSE   FALSE
## Spain                                        FALSE FALSE    FALSE   FALSE
## Italy                                        FALSE FALSE    FALSE   FALSE
## Czech Republic                               FALSE FALSE    FALSE   FALSE
## Greece                                       FALSE FALSE    FALSE   FALSE
## Estonia                                      FALSE FALSE    FALSE   FALSE
## Cyprus                                       FALSE FALSE    FALSE   FALSE
## Qatar                                        FALSE FALSE    FALSE   FALSE
## Slovakia                                     FALSE FALSE    FALSE   FALSE
## Poland                                       FALSE FALSE    FALSE   FALSE
## Lithuania                                    FALSE FALSE    FALSE   FALSE
## Malta                                        FALSE FALSE    FALSE   FALSE
## Saudi Arabia                                 FALSE FALSE    FALSE   FALSE
## Argentina                                    FALSE FALSE    FALSE   FALSE
## United Arab Emirates                         FALSE FALSE    FALSE   FALSE
## Chile                                        FALSE FALSE    FALSE   FALSE
## Portugal                                     FALSE FALSE    FALSE   FALSE
## Hungary                                      FALSE FALSE    FALSE   FALSE
## Bahrain                                      FALSE FALSE    FALSE   FALSE
## Latvia                                       FALSE FALSE    FALSE   FALSE
## Croatia                                      FALSE FALSE    FALSE   FALSE
## Kuwait                                       FALSE FALSE    FALSE   FALSE
## Montenegro                                   FALSE FALSE    FALSE   FALSE
## Belarus                                      FALSE FALSE    FALSE   FALSE
## Russian Federation                           FALSE FALSE    FALSE   FALSE
## Oman                                         FALSE FALSE    FALSE   FALSE
## Romania                                      FALSE FALSE    FALSE   FALSE
## Uruguay                                      FALSE FALSE    FALSE   FALSE
## Bahamas                                      FALSE FALSE    FALSE   FALSE
## Kazakhstan                                   FALSE FALSE    FALSE   FALSE
## Barbados                                     FALSE FALSE    FALSE   FALSE
## Bulgaria                                     FALSE FALSE    FALSE   FALSE
## Panama                                       FALSE FALSE    FALSE   FALSE
## Malaysia                                     FALSE FALSE    FALSE   FALSE
## Mauritius                                    FALSE FALSE    FALSE   FALSE
## Trinidad and Tobago                          FALSE FALSE    FALSE   FALSE
## Serbia                                       FALSE FALSE    FALSE   FALSE
## Cuba                                         FALSE FALSE    FALSE   FALSE
## Lebanon                                      FALSE FALSE    FALSE   FALSE
## Costa Rica                                   FALSE FALSE    FALSE   FALSE
## Iran (Islamic Republic of)                   FALSE FALSE    FALSE   FALSE
## Venezuela (Bolivarian Republic of)           FALSE FALSE    FALSE   FALSE
## Turkey                                       FALSE FALSE    FALSE   FALSE
## Sri Lanka                                    FALSE FALSE    FALSE   FALSE
## Mexico                                       FALSE FALSE    FALSE   FALSE
## Brazil                                       FALSE FALSE    FALSE   FALSE
## Georgia                                      FALSE FALSE    FALSE   FALSE
## Azerbaijan                                   FALSE FALSE    FALSE   FALSE
## Jordan                                       FALSE FALSE    FALSE   FALSE
## The former Yugoslav Republic of Macedonia    FALSE FALSE    FALSE   FALSE
## Ukraine                                      FALSE FALSE    FALSE   FALSE
## Algeria                                      FALSE FALSE    FALSE   FALSE
## Peru                                         FALSE FALSE    FALSE   FALSE
## Albania                                      FALSE FALSE    FALSE   FALSE
## Armenia                                      FALSE FALSE    FALSE   FALSE
## Bosnia and Herzegovina                       FALSE FALSE    FALSE   FALSE
## Ecuador                                      FALSE FALSE    FALSE   FALSE
## China                                        FALSE FALSE    FALSE   FALSE
## Fiji                                         FALSE FALSE    FALSE   FALSE
## Mongolia                                     FALSE FALSE    FALSE   FALSE
## Thailand                                     FALSE FALSE    FALSE   FALSE
## Libya                                        FALSE FALSE    FALSE   FALSE
## Tunisia                                      FALSE FALSE    FALSE   FALSE
## Colombia                                     FALSE FALSE    FALSE   FALSE
## Jamaica                                      FALSE FALSE    FALSE   FALSE
## Tonga                                        FALSE FALSE    FALSE   FALSE
## Belize                                       FALSE FALSE    FALSE   FALSE
## Dominican Republic                           FALSE FALSE    FALSE   FALSE
## Suriname                                     FALSE FALSE    FALSE   FALSE
## Maldives                                     FALSE FALSE    FALSE   FALSE
## Samoa                                        FALSE FALSE    FALSE   FALSE
## Botswana                                     FALSE FALSE    FALSE   FALSE
## Moldova (Republic of)                        FALSE FALSE    FALSE   FALSE
## Egypt                                        FALSE FALSE    FALSE   FALSE
## Gabon                                        FALSE FALSE    FALSE   FALSE
## Indonesia                                    FALSE FALSE    FALSE   FALSE
## Paraguay                                     FALSE FALSE    FALSE   FALSE
## Philippines                                  FALSE FALSE    FALSE   FALSE
## El Salvador                                  FALSE FALSE    FALSE   FALSE
## South Africa                                 FALSE FALSE    FALSE   FALSE
## Viet Nam                                     FALSE FALSE    FALSE   FALSE
## Bolivia (Plurinational State of)             FALSE FALSE    FALSE   FALSE
## Kyrgyzstan                                   FALSE FALSE    FALSE   FALSE
## Iraq                                         FALSE FALSE    FALSE   FALSE
## Guyana                                       FALSE FALSE    FALSE   FALSE
## Nicaragua                                    FALSE FALSE    FALSE   FALSE
## Morocco                                      FALSE FALSE    FALSE   FALSE
## Namibia                                      FALSE FALSE    FALSE   FALSE
## Guatemala                                    FALSE FALSE    FALSE   FALSE
## Tajikistan                                   FALSE FALSE    FALSE   FALSE
## India                                        FALSE FALSE    FALSE   FALSE
## Honduras                                     FALSE FALSE    FALSE   FALSE
## Bhutan                                       FALSE FALSE    FALSE   FALSE
## Syrian Arab Republic                         FALSE FALSE    FALSE   FALSE
## Congo                                        FALSE FALSE    FALSE   FALSE
## Zambia                                       FALSE FALSE    FALSE   FALSE
## Ghana                                        FALSE FALSE    FALSE   FALSE
## Bangladesh                                   FALSE FALSE    FALSE   FALSE
## Cambodia                                     FALSE FALSE    FALSE   FALSE
## Kenya                                        FALSE FALSE    FALSE   FALSE
## Nepal                                        FALSE FALSE    FALSE   FALSE
## Pakistan                                     FALSE FALSE    FALSE   FALSE
## Myanmar                                      FALSE FALSE    FALSE   FALSE
## Swaziland                                    FALSE FALSE    FALSE   FALSE
## Tanzania (United Republic of)                FALSE FALSE    FALSE   FALSE
## Cameroon                                     FALSE FALSE    FALSE   FALSE
## Zimbabwe                                     FALSE FALSE    FALSE   FALSE
## Mauritania                                   FALSE FALSE    FALSE   FALSE
## Papua New Guinea                             FALSE FALSE    FALSE   FALSE
## Yemen                                        FALSE FALSE    FALSE   FALSE
## Lesotho                                      FALSE FALSE    FALSE   FALSE
## Togo                                         FALSE FALSE    FALSE   FALSE
## Haiti                                        FALSE FALSE    FALSE   FALSE
## Rwanda                                       FALSE FALSE    FALSE   FALSE
## Uganda                                       FALSE FALSE    FALSE   FALSE
## Benin                                        FALSE FALSE    FALSE   FALSE
## Sudan                                        FALSE FALSE    FALSE   FALSE
## Senegal                                      FALSE FALSE    FALSE   FALSE
## Afghanistan                                  FALSE FALSE    FALSE   FALSE
## Côte d'Ivoire                                FALSE FALSE    FALSE   FALSE
## Malawi                                       FALSE FALSE    FALSE   FALSE
## Ethiopia                                     FALSE FALSE    FALSE   FALSE
## Gambia                                       FALSE FALSE    FALSE   FALSE
## Congo (Democratic Republic of the)           FALSE FALSE    FALSE   FALSE
## Liberia                                      FALSE FALSE    FALSE   FALSE
## Mali                                         FALSE FALSE    FALSE   FALSE
## Mozambique                                   FALSE FALSE    FALSE   FALSE
## Sierra Leone                                 FALSE FALSE    FALSE   FALSE
## Burkina Faso                                 FALSE FALSE    FALSE   FALSE
## Burundi                                      FALSE FALSE    FALSE   FALSE
## Chad                                         FALSE FALSE    FALSE   FALSE
## Central African Republic                     FALSE FALSE    FALSE   FALSE
## Niger                                        FALSE FALSE    FALSE   FALSE
##                                           Edu.Mean   GNI GNI.Minus.Rank
## Norway                                       FALSE FALSE          FALSE
## Australia                                    FALSE FALSE          FALSE
## Switzerland                                  FALSE FALSE          FALSE
## Denmark                                      FALSE FALSE          FALSE
## Netherlands                                  FALSE FALSE          FALSE
## Germany                                      FALSE FALSE          FALSE
## Ireland                                      FALSE FALSE          FALSE
## United States                                FALSE FALSE          FALSE
## Canada                                       FALSE FALSE          FALSE
## New Zealand                                  FALSE FALSE          FALSE
## Singapore                                    FALSE FALSE          FALSE
## Sweden                                       FALSE FALSE          FALSE
## United Kingdom                               FALSE FALSE          FALSE
## Iceland                                      FALSE FALSE          FALSE
## Korea (Republic of)                          FALSE FALSE          FALSE
## Israel                                       FALSE FALSE          FALSE
## Luxembourg                                   FALSE FALSE          FALSE
## Japan                                        FALSE FALSE          FALSE
## Belgium                                      FALSE FALSE          FALSE
## France                                       FALSE FALSE          FALSE
## Austria                                      FALSE FALSE          FALSE
## Finland                                      FALSE FALSE          FALSE
## Slovenia                                     FALSE FALSE          FALSE
## Spain                                        FALSE FALSE          FALSE
## Italy                                        FALSE FALSE          FALSE
## Czech Republic                               FALSE FALSE          FALSE
## Greece                                       FALSE FALSE          FALSE
## Estonia                                      FALSE FALSE          FALSE
## Cyprus                                       FALSE FALSE          FALSE
## Qatar                                        FALSE FALSE          FALSE
## Slovakia                                     FALSE FALSE          FALSE
## Poland                                       FALSE FALSE          FALSE
## Lithuania                                    FALSE FALSE          FALSE
## Malta                                        FALSE FALSE          FALSE
## Saudi Arabia                                 FALSE FALSE          FALSE
## Argentina                                    FALSE FALSE          FALSE
## United Arab Emirates                         FALSE FALSE          FALSE
## Chile                                        FALSE FALSE          FALSE
## Portugal                                     FALSE FALSE          FALSE
## Hungary                                      FALSE FALSE          FALSE
## Bahrain                                      FALSE FALSE          FALSE
## Latvia                                       FALSE FALSE          FALSE
## Croatia                                      FALSE FALSE          FALSE
## Kuwait                                       FALSE FALSE          FALSE
## Montenegro                                   FALSE FALSE          FALSE
## Belarus                                      FALSE FALSE          FALSE
## Russian Federation                           FALSE FALSE          FALSE
## Oman                                         FALSE FALSE          FALSE
## Romania                                      FALSE FALSE          FALSE
## Uruguay                                      FALSE FALSE          FALSE
## Bahamas                                      FALSE FALSE          FALSE
## Kazakhstan                                   FALSE FALSE          FALSE
## Barbados                                     FALSE FALSE          FALSE
## Bulgaria                                     FALSE FALSE          FALSE
## Panama                                       FALSE FALSE          FALSE
## Malaysia                                     FALSE FALSE          FALSE
## Mauritius                                    FALSE FALSE          FALSE
## Trinidad and Tobago                          FALSE FALSE          FALSE
## Serbia                                       FALSE FALSE          FALSE
## Cuba                                         FALSE FALSE          FALSE
## Lebanon                                      FALSE FALSE          FALSE
## Costa Rica                                   FALSE FALSE          FALSE
## Iran (Islamic Republic of)                   FALSE FALSE          FALSE
## Venezuela (Bolivarian Republic of)           FALSE FALSE          FALSE
## Turkey                                       FALSE FALSE          FALSE
## Sri Lanka                                    FALSE FALSE          FALSE
## Mexico                                       FALSE FALSE          FALSE
## Brazil                                       FALSE FALSE          FALSE
## Georgia                                      FALSE FALSE          FALSE
## Azerbaijan                                   FALSE FALSE          FALSE
## Jordan                                       FALSE FALSE          FALSE
## The former Yugoslav Republic of Macedonia    FALSE FALSE          FALSE
## Ukraine                                      FALSE FALSE          FALSE
## Algeria                                      FALSE FALSE          FALSE
## Peru                                         FALSE FALSE          FALSE
## Albania                                      FALSE FALSE          FALSE
## Armenia                                      FALSE FALSE          FALSE
## Bosnia and Herzegovina                       FALSE FALSE          FALSE
## Ecuador                                      FALSE FALSE          FALSE
## China                                        FALSE FALSE          FALSE
## Fiji                                         FALSE FALSE          FALSE
## Mongolia                                     FALSE FALSE          FALSE
## Thailand                                     FALSE FALSE          FALSE
## Libya                                        FALSE FALSE          FALSE
## Tunisia                                      FALSE FALSE          FALSE
## Colombia                                     FALSE FALSE          FALSE
## Jamaica                                      FALSE FALSE          FALSE
## Tonga                                        FALSE FALSE          FALSE
## Belize                                       FALSE FALSE          FALSE
## Dominican Republic                           FALSE FALSE          FALSE
## Suriname                                     FALSE FALSE          FALSE
## Maldives                                     FALSE FALSE          FALSE
## Samoa                                        FALSE FALSE          FALSE
## Botswana                                     FALSE FALSE          FALSE
## Moldova (Republic of)                        FALSE FALSE          FALSE
## Egypt                                        FALSE FALSE          FALSE
## Gabon                                        FALSE FALSE          FALSE
## Indonesia                                    FALSE FALSE          FALSE
## Paraguay                                     FALSE FALSE          FALSE
## Philippines                                  FALSE FALSE          FALSE
## El Salvador                                  FALSE FALSE          FALSE
## South Africa                                 FALSE FALSE          FALSE
## Viet Nam                                     FALSE FALSE          FALSE
## Bolivia (Plurinational State of)             FALSE FALSE          FALSE
## Kyrgyzstan                                   FALSE FALSE          FALSE
## Iraq                                         FALSE FALSE          FALSE
## Guyana                                       FALSE FALSE          FALSE
## Nicaragua                                    FALSE FALSE          FALSE
## Morocco                                      FALSE FALSE          FALSE
## Namibia                                      FALSE FALSE          FALSE
## Guatemala                                    FALSE FALSE          FALSE
## Tajikistan                                   FALSE FALSE          FALSE
## India                                        FALSE FALSE          FALSE
## Honduras                                     FALSE FALSE          FALSE
## Bhutan                                       FALSE FALSE          FALSE
## Syrian Arab Republic                         FALSE FALSE          FALSE
## Congo                                        FALSE FALSE          FALSE
## Zambia                                       FALSE FALSE          FALSE
## Ghana                                        FALSE FALSE          FALSE
## Bangladesh                                   FALSE FALSE          FALSE
## Cambodia                                     FALSE FALSE          FALSE
## Kenya                                        FALSE FALSE          FALSE
## Nepal                                        FALSE FALSE          FALSE
## Pakistan                                     FALSE FALSE          FALSE
## Myanmar                                      FALSE FALSE          FALSE
## Swaziland                                    FALSE FALSE          FALSE
## Tanzania (United Republic of)                FALSE FALSE          FALSE
## Cameroon                                     FALSE FALSE          FALSE
## Zimbabwe                                     FALSE FALSE          FALSE
## Mauritania                                   FALSE FALSE          FALSE
## Papua New Guinea                             FALSE FALSE          FALSE
## Yemen                                        FALSE FALSE          FALSE
## Lesotho                                      FALSE FALSE          FALSE
## Togo                                         FALSE FALSE          FALSE
## Haiti                                        FALSE FALSE          FALSE
## Rwanda                                       FALSE FALSE          FALSE
## Uganda                                       FALSE FALSE          FALSE
## Benin                                        FALSE FALSE          FALSE
## Sudan                                        FALSE FALSE          FALSE
## Senegal                                      FALSE FALSE          FALSE
## Afghanistan                                  FALSE FALSE          FALSE
## Côte d'Ivoire                                FALSE FALSE          FALSE
## Malawi                                       FALSE FALSE          FALSE
## Ethiopia                                     FALSE FALSE          FALSE
## Gambia                                       FALSE FALSE          FALSE
## Congo (Democratic Republic of the)           FALSE FALSE          FALSE
## Liberia                                      FALSE FALSE          FALSE
## Mali                                         FALSE FALSE          FALSE
## Mozambique                                   FALSE FALSE          FALSE
## Sierra Leone                                 FALSE FALSE          FALSE
## Burkina Faso                                 FALSE FALSE          FALSE
## Burundi                                      FALSE FALSE          FALSE
## Chad                                         FALSE FALSE          FALSE
## Central African Republic                     FALSE FALSE          FALSE
## Niger                                        FALSE FALSE          FALSE
##                                           GII.Rank   GII Mat.Mor Ado.Birth
## Norway                                       FALSE FALSE   FALSE     FALSE
## Australia                                    FALSE FALSE   FALSE     FALSE
## Switzerland                                  FALSE FALSE   FALSE     FALSE
## Denmark                                      FALSE FALSE   FALSE     FALSE
## Netherlands                                  FALSE FALSE   FALSE     FALSE
## Germany                                      FALSE FALSE   FALSE     FALSE
## Ireland                                      FALSE FALSE   FALSE     FALSE
## United States                                FALSE FALSE   FALSE     FALSE
## Canada                                       FALSE FALSE   FALSE     FALSE
## New Zealand                                  FALSE FALSE   FALSE     FALSE
## Singapore                                    FALSE FALSE   FALSE     FALSE
## Sweden                                       FALSE FALSE   FALSE     FALSE
## United Kingdom                               FALSE FALSE   FALSE     FALSE
## Iceland                                      FALSE FALSE   FALSE     FALSE
## Korea (Republic of)                          FALSE FALSE   FALSE     FALSE
## Israel                                       FALSE FALSE   FALSE     FALSE
## Luxembourg                                   FALSE FALSE   FALSE     FALSE
## Japan                                        FALSE FALSE   FALSE     FALSE
## Belgium                                      FALSE FALSE   FALSE     FALSE
## France                                       FALSE FALSE   FALSE     FALSE
## Austria                                      FALSE FALSE   FALSE     FALSE
## Finland                                      FALSE FALSE   FALSE     FALSE
## Slovenia                                     FALSE FALSE   FALSE     FALSE
## Spain                                        FALSE FALSE   FALSE     FALSE
## Italy                                        FALSE FALSE   FALSE     FALSE
## Czech Republic                               FALSE FALSE   FALSE     FALSE
## Greece                                       FALSE FALSE   FALSE     FALSE
## Estonia                                      FALSE FALSE   FALSE     FALSE
## Cyprus                                       FALSE FALSE   FALSE     FALSE
## Qatar                                        FALSE FALSE   FALSE     FALSE
## Slovakia                                     FALSE FALSE   FALSE     FALSE
## Poland                                       FALSE FALSE   FALSE     FALSE
## Lithuania                                    FALSE FALSE   FALSE     FALSE
## Malta                                        FALSE FALSE   FALSE     FALSE
## Saudi Arabia                                 FALSE FALSE   FALSE     FALSE
## Argentina                                    FALSE FALSE   FALSE     FALSE
## United Arab Emirates                         FALSE FALSE   FALSE     FALSE
## Chile                                        FALSE FALSE   FALSE     FALSE
## Portugal                                     FALSE FALSE   FALSE     FALSE
## Hungary                                      FALSE FALSE   FALSE     FALSE
## Bahrain                                      FALSE FALSE   FALSE     FALSE
## Latvia                                       FALSE FALSE   FALSE     FALSE
## Croatia                                      FALSE FALSE   FALSE     FALSE
## Kuwait                                       FALSE FALSE   FALSE     FALSE
## Montenegro                                   FALSE FALSE   FALSE     FALSE
## Belarus                                      FALSE FALSE   FALSE     FALSE
## Russian Federation                           FALSE FALSE   FALSE     FALSE
## Oman                                         FALSE FALSE   FALSE     FALSE
## Romania                                      FALSE FALSE   FALSE     FALSE
## Uruguay                                      FALSE FALSE   FALSE     FALSE
## Bahamas                                      FALSE FALSE   FALSE     FALSE
## Kazakhstan                                   FALSE FALSE   FALSE     FALSE
## Barbados                                     FALSE FALSE   FALSE     FALSE
## Bulgaria                                     FALSE FALSE   FALSE     FALSE
## Panama                                       FALSE FALSE   FALSE     FALSE
## Malaysia                                     FALSE FALSE   FALSE     FALSE
## Mauritius                                    FALSE FALSE   FALSE     FALSE
## Trinidad and Tobago                          FALSE FALSE   FALSE     FALSE
## Serbia                                       FALSE FALSE   FALSE     FALSE
## Cuba                                         FALSE FALSE   FALSE     FALSE
## Lebanon                                      FALSE FALSE   FALSE     FALSE
## Costa Rica                                   FALSE FALSE   FALSE     FALSE
## Iran (Islamic Republic of)                   FALSE FALSE   FALSE     FALSE
## Venezuela (Bolivarian Republic of)           FALSE FALSE   FALSE     FALSE
## Turkey                                       FALSE FALSE   FALSE     FALSE
## Sri Lanka                                    FALSE FALSE   FALSE     FALSE
## Mexico                                       FALSE FALSE   FALSE     FALSE
## Brazil                                       FALSE FALSE   FALSE     FALSE
## Georgia                                      FALSE FALSE   FALSE     FALSE
## Azerbaijan                                   FALSE FALSE   FALSE     FALSE
## Jordan                                       FALSE FALSE   FALSE     FALSE
## The former Yugoslav Republic of Macedonia    FALSE FALSE   FALSE     FALSE
## Ukraine                                      FALSE FALSE   FALSE     FALSE
## Algeria                                      FALSE FALSE   FALSE     FALSE
## Peru                                         FALSE FALSE   FALSE     FALSE
## Albania                                      FALSE FALSE   FALSE     FALSE
## Armenia                                      FALSE FALSE   FALSE     FALSE
## Bosnia and Herzegovina                       FALSE FALSE   FALSE     FALSE
## Ecuador                                      FALSE FALSE   FALSE     FALSE
## China                                        FALSE FALSE   FALSE     FALSE
## Fiji                                         FALSE FALSE   FALSE     FALSE
## Mongolia                                     FALSE FALSE   FALSE     FALSE
## Thailand                                     FALSE FALSE   FALSE     FALSE
## Libya                                        FALSE FALSE   FALSE     FALSE
## Tunisia                                      FALSE FALSE   FALSE     FALSE
## Colombia                                     FALSE FALSE   FALSE     FALSE
## Jamaica                                      FALSE FALSE   FALSE     FALSE
## Tonga                                        FALSE FALSE   FALSE     FALSE
## Belize                                       FALSE FALSE   FALSE     FALSE
## Dominican Republic                           FALSE FALSE   FALSE     FALSE
## Suriname                                     FALSE FALSE   FALSE     FALSE
## Maldives                                     FALSE FALSE   FALSE     FALSE
## Samoa                                        FALSE FALSE   FALSE     FALSE
## Botswana                                     FALSE FALSE   FALSE     FALSE
## Moldova (Republic of)                        FALSE FALSE   FALSE     FALSE
## Egypt                                        FALSE FALSE   FALSE     FALSE
## Gabon                                        FALSE FALSE   FALSE     FALSE
## Indonesia                                    FALSE FALSE   FALSE     FALSE
## Paraguay                                     FALSE FALSE   FALSE     FALSE
## Philippines                                  FALSE FALSE   FALSE     FALSE
## El Salvador                                  FALSE FALSE   FALSE     FALSE
## South Africa                                 FALSE FALSE   FALSE     FALSE
## Viet Nam                                     FALSE FALSE   FALSE     FALSE
## Bolivia (Plurinational State of)             FALSE FALSE   FALSE     FALSE
## Kyrgyzstan                                   FALSE FALSE   FALSE     FALSE
## Iraq                                         FALSE FALSE   FALSE     FALSE
## Guyana                                       FALSE FALSE   FALSE     FALSE
## Nicaragua                                    FALSE FALSE   FALSE     FALSE
## Morocco                                      FALSE FALSE   FALSE     FALSE
## Namibia                                      FALSE FALSE   FALSE     FALSE
## Guatemala                                    FALSE FALSE   FALSE     FALSE
## Tajikistan                                   FALSE FALSE   FALSE     FALSE
## India                                        FALSE FALSE   FALSE     FALSE
## Honduras                                     FALSE FALSE   FALSE     FALSE
## Bhutan                                       FALSE FALSE   FALSE     FALSE
## Syrian Arab Republic                         FALSE FALSE   FALSE     FALSE
## Congo                                        FALSE FALSE   FALSE     FALSE
## Zambia                                       FALSE FALSE   FALSE     FALSE
## Ghana                                        FALSE FALSE   FALSE     FALSE
## Bangladesh                                   FALSE FALSE   FALSE     FALSE
## Cambodia                                     FALSE FALSE   FALSE     FALSE
## Kenya                                        FALSE FALSE   FALSE     FALSE
## Nepal                                        FALSE FALSE   FALSE     FALSE
## Pakistan                                     FALSE FALSE   FALSE     FALSE
## Myanmar                                      FALSE FALSE   FALSE     FALSE
## Swaziland                                    FALSE FALSE   FALSE     FALSE
## Tanzania (United Republic of)                FALSE FALSE   FALSE     FALSE
## Cameroon                                     FALSE FALSE   FALSE     FALSE
## Zimbabwe                                     FALSE FALSE   FALSE     FALSE
## Mauritania                                   FALSE FALSE   FALSE     FALSE
## Papua New Guinea                             FALSE FALSE   FALSE     FALSE
## Yemen                                        FALSE FALSE   FALSE     FALSE
## Lesotho                                      FALSE FALSE   FALSE     FALSE
## Togo                                         FALSE FALSE   FALSE     FALSE
## Haiti                                        FALSE FALSE   FALSE     FALSE
## Rwanda                                       FALSE FALSE   FALSE     FALSE
## Uganda                                       FALSE FALSE   FALSE     FALSE
## Benin                                        FALSE FALSE   FALSE     FALSE
## Sudan                                        FALSE FALSE   FALSE     FALSE
## Senegal                                      FALSE FALSE   FALSE     FALSE
## Afghanistan                                  FALSE FALSE   FALSE     FALSE
## Côte d'Ivoire                                FALSE FALSE   FALSE     FALSE
## Malawi                                       FALSE FALSE   FALSE     FALSE
## Ethiopia                                     FALSE FALSE   FALSE     FALSE
## Gambia                                       FALSE FALSE   FALSE     FALSE
## Congo (Democratic Republic of the)           FALSE FALSE   FALSE     FALSE
## Liberia                                      FALSE FALSE   FALSE     FALSE
## Mali                                         FALSE FALSE   FALSE     FALSE
## Mozambique                                   FALSE FALSE   FALSE     FALSE
## Sierra Leone                                 FALSE FALSE   FALSE     FALSE
## Burkina Faso                                 FALSE FALSE   FALSE     FALSE
## Burundi                                      FALSE FALSE   FALSE     FALSE
## Chad                                         FALSE FALSE   FALSE     FALSE
## Central African Republic                     FALSE FALSE   FALSE     FALSE
## Niger                                        FALSE FALSE   FALSE     FALSE
##                                           Parli.F Edu2.F Edu2.M Labo.F Labo.M
## Norway                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Australia                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Switzerland                                 FALSE  FALSE  FALSE  FALSE  FALSE
## Denmark                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Netherlands                                 FALSE  FALSE  FALSE  FALSE  FALSE
## Germany                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Ireland                                     FALSE  FALSE  FALSE  FALSE  FALSE
## United States                               FALSE  FALSE  FALSE  FALSE  FALSE
## Canada                                      FALSE  FALSE  FALSE  FALSE  FALSE
## New Zealand                                 FALSE  FALSE  FALSE  FALSE  FALSE
## Singapore                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Sweden                                      FALSE  FALSE  FALSE  FALSE  FALSE
## United Kingdom                              FALSE  FALSE  FALSE  FALSE  FALSE
## Iceland                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Korea (Republic of)                         FALSE  FALSE  FALSE  FALSE  FALSE
## Israel                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Luxembourg                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Japan                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Belgium                                     FALSE  FALSE  FALSE  FALSE  FALSE
## France                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Austria                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Finland                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Slovenia                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Spain                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Italy                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Czech Republic                              FALSE  FALSE  FALSE  FALSE  FALSE
## Greece                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Estonia                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Cyprus                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Qatar                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Slovakia                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Poland                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Lithuania                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Malta                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Saudi Arabia                                FALSE  FALSE  FALSE  FALSE  FALSE
## Argentina                                   FALSE  FALSE  FALSE  FALSE  FALSE
## United Arab Emirates                        FALSE  FALSE  FALSE  FALSE  FALSE
## Chile                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Portugal                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Hungary                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Bahrain                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Latvia                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Croatia                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Kuwait                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Montenegro                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Belarus                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Russian Federation                          FALSE  FALSE  FALSE  FALSE  FALSE
## Oman                                        FALSE  FALSE  FALSE  FALSE  FALSE
## Romania                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Uruguay                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Bahamas                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Kazakhstan                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Barbados                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Bulgaria                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Panama                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Malaysia                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Mauritius                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Trinidad and Tobago                         FALSE  FALSE  FALSE  FALSE  FALSE
## Serbia                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Cuba                                        FALSE  FALSE  FALSE  FALSE  FALSE
## Lebanon                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Costa Rica                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Iran (Islamic Republic of)                  FALSE  FALSE  FALSE  FALSE  FALSE
## Venezuela (Bolivarian Republic of)          FALSE  FALSE  FALSE  FALSE  FALSE
## Turkey                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Sri Lanka                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Mexico                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Brazil                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Georgia                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Azerbaijan                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Jordan                                      FALSE  FALSE  FALSE  FALSE  FALSE
## The former Yugoslav Republic of Macedonia   FALSE  FALSE  FALSE  FALSE  FALSE
## Ukraine                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Algeria                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Peru                                        FALSE  FALSE  FALSE  FALSE  FALSE
## Albania                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Armenia                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Bosnia and Herzegovina                      FALSE  FALSE  FALSE  FALSE  FALSE
## Ecuador                                     FALSE  FALSE  FALSE  FALSE  FALSE
## China                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Fiji                                        FALSE  FALSE  FALSE  FALSE  FALSE
## Mongolia                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Thailand                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Libya                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Tunisia                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Colombia                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Jamaica                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Tonga                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Belize                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Dominican Republic                          FALSE  FALSE  FALSE  FALSE  FALSE
## Suriname                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Maldives                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Samoa                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Botswana                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Moldova (Republic of)                       FALSE  FALSE  FALSE  FALSE  FALSE
## Egypt                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Gabon                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Indonesia                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Paraguay                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Philippines                                 FALSE  FALSE  FALSE  FALSE  FALSE
## El Salvador                                 FALSE  FALSE  FALSE  FALSE  FALSE
## South Africa                                FALSE  FALSE  FALSE  FALSE  FALSE
## Viet Nam                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Bolivia (Plurinational State of)            FALSE  FALSE  FALSE  FALSE  FALSE
## Kyrgyzstan                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Iraq                                        FALSE  FALSE  FALSE  FALSE  FALSE
## Guyana                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Nicaragua                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Morocco                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Namibia                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Guatemala                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Tajikistan                                  FALSE  FALSE  FALSE  FALSE  FALSE
## India                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Honduras                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Bhutan                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Syrian Arab Republic                        FALSE  FALSE  FALSE  FALSE  FALSE
## Congo                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Zambia                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Ghana                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Bangladesh                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Cambodia                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Kenya                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Nepal                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Pakistan                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Myanmar                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Swaziland                                   FALSE  FALSE  FALSE  FALSE  FALSE
## Tanzania (United Republic of)               FALSE  FALSE  FALSE  FALSE  FALSE
## Cameroon                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Zimbabwe                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Mauritania                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Papua New Guinea                            FALSE  FALSE  FALSE  FALSE  FALSE
## Yemen                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Lesotho                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Togo                                        FALSE  FALSE  FALSE  FALSE  FALSE
## Haiti                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Rwanda                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Uganda                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Benin                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Sudan                                       FALSE  FALSE  FALSE  FALSE  FALSE
## Senegal                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Afghanistan                                 FALSE  FALSE  FALSE  FALSE  FALSE
## Côte d'Ivoire                               FALSE  FALSE  FALSE  FALSE  FALSE
## Malawi                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Ethiopia                                    FALSE  FALSE  FALSE  FALSE  FALSE
## Gambia                                      FALSE  FALSE  FALSE  FALSE  FALSE
## Congo (Democratic Republic of the)          FALSE  FALSE  FALSE  FALSE  FALSE
## Liberia                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Mali                                        FALSE  FALSE  FALSE  FALSE  FALSE
## Mozambique                                  FALSE  FALSE  FALSE  FALSE  FALSE
## Sierra Leone                                FALSE  FALSE  FALSE  FALSE  FALSE
## Burkina Faso                                FALSE  FALSE  FALSE  FALSE  FALSE
## Burundi                                     FALSE  FALSE  FALSE  FALSE  FALSE
## Chad                                        FALSE  FALSE  FALSE  FALSE  FALSE
## Central African Republic                    FALSE  FALSE  FALSE  FALSE  FALSE
## Niger                                       FALSE  FALSE  FALSE  FALSE  FALSE
##                                           Edu2.FM Labo.FM
## Norway                                      FALSE   FALSE
## Australia                                   FALSE   FALSE
## Switzerland                                 FALSE   FALSE
## Denmark                                     FALSE   FALSE
## Netherlands                                 FALSE   FALSE
## Germany                                     FALSE   FALSE
## Ireland                                     FALSE   FALSE
## United States                               FALSE   FALSE
## Canada                                      FALSE   FALSE
## New Zealand                                 FALSE   FALSE
## Singapore                                   FALSE   FALSE
## Sweden                                      FALSE   FALSE
## United Kingdom                              FALSE   FALSE
## Iceland                                     FALSE   FALSE
## Korea (Republic of)                         FALSE   FALSE
## Israel                                      FALSE   FALSE
## Luxembourg                                  FALSE   FALSE
## Japan                                       FALSE   FALSE
## Belgium                                     FALSE   FALSE
## France                                      FALSE   FALSE
## Austria                                     FALSE   FALSE
## Finland                                     FALSE   FALSE
## Slovenia                                    FALSE   FALSE
## Spain                                       FALSE   FALSE
## Italy                                       FALSE   FALSE
## Czech Republic                              FALSE   FALSE
## Greece                                      FALSE   FALSE
## Estonia                                     FALSE   FALSE
## Cyprus                                      FALSE   FALSE
## Qatar                                       FALSE   FALSE
## Slovakia                                    FALSE   FALSE
## Poland                                      FALSE   FALSE
## Lithuania                                   FALSE   FALSE
## Malta                                       FALSE   FALSE
## Saudi Arabia                                FALSE   FALSE
## Argentina                                   FALSE   FALSE
## United Arab Emirates                        FALSE   FALSE
## Chile                                       FALSE   FALSE
## Portugal                                    FALSE   FALSE
## Hungary                                     FALSE   FALSE
## Bahrain                                     FALSE   FALSE
## Latvia                                      FALSE   FALSE
## Croatia                                     FALSE   FALSE
## Kuwait                                      FALSE   FALSE
## Montenegro                                  FALSE   FALSE
## Belarus                                     FALSE   FALSE
## Russian Federation                          FALSE   FALSE
## Oman                                        FALSE   FALSE
## Romania                                     FALSE   FALSE
## Uruguay                                     FALSE   FALSE
## Bahamas                                     FALSE   FALSE
## Kazakhstan                                  FALSE   FALSE
## Barbados                                    FALSE   FALSE
## Bulgaria                                    FALSE   FALSE
## Panama                                      FALSE   FALSE
## Malaysia                                    FALSE   FALSE
## Mauritius                                   FALSE   FALSE
## Trinidad and Tobago                         FALSE   FALSE
## Serbia                                      FALSE   FALSE
## Cuba                                        FALSE   FALSE
## Lebanon                                     FALSE   FALSE
## Costa Rica                                  FALSE   FALSE
## Iran (Islamic Republic of)                  FALSE   FALSE
## Venezuela (Bolivarian Republic of)          FALSE   FALSE
## Turkey                                      FALSE   FALSE
## Sri Lanka                                   FALSE   FALSE
## Mexico                                      FALSE   FALSE
## Brazil                                      FALSE   FALSE
## Georgia                                     FALSE   FALSE
## Azerbaijan                                  FALSE   FALSE
## Jordan                                      FALSE   FALSE
## The former Yugoslav Republic of Macedonia   FALSE   FALSE
## Ukraine                                     FALSE   FALSE
## Algeria                                     FALSE   FALSE
## Peru                                        FALSE   FALSE
## Albania                                     FALSE   FALSE
## Armenia                                     FALSE   FALSE
## Bosnia and Herzegovina                      FALSE   FALSE
## Ecuador                                     FALSE   FALSE
## China                                       FALSE   FALSE
## Fiji                                        FALSE   FALSE
## Mongolia                                    FALSE   FALSE
## Thailand                                    FALSE   FALSE
## Libya                                       FALSE   FALSE
## Tunisia                                     FALSE   FALSE
## Colombia                                    FALSE   FALSE
## Jamaica                                     FALSE   FALSE
## Tonga                                       FALSE   FALSE
## Belize                                      FALSE   FALSE
## Dominican Republic                          FALSE   FALSE
## Suriname                                    FALSE   FALSE
## Maldives                                    FALSE   FALSE
## Samoa                                       FALSE   FALSE
## Botswana                                    FALSE   FALSE
## Moldova (Republic of)                       FALSE   FALSE
## Egypt                                       FALSE   FALSE
## Gabon                                       FALSE   FALSE
## Indonesia                                   FALSE   FALSE
## Paraguay                                    FALSE   FALSE
## Philippines                                 FALSE   FALSE
## El Salvador                                 FALSE   FALSE
## South Africa                                FALSE   FALSE
## Viet Nam                                    FALSE   FALSE
## Bolivia (Plurinational State of)            FALSE   FALSE
## Kyrgyzstan                                  FALSE   FALSE
## Iraq                                        FALSE   FALSE
## Guyana                                      FALSE   FALSE
## Nicaragua                                   FALSE   FALSE
## Morocco                                     FALSE   FALSE
## Namibia                                     FALSE   FALSE
## Guatemala                                   FALSE   FALSE
## Tajikistan                                  FALSE   FALSE
## India                                       FALSE   FALSE
## Honduras                                    FALSE   FALSE
## Bhutan                                      FALSE   FALSE
## Syrian Arab Republic                        FALSE   FALSE
## Congo                                       FALSE   FALSE
## Zambia                                      FALSE   FALSE
## Ghana                                       FALSE   FALSE
## Bangladesh                                  FALSE   FALSE
## Cambodia                                    FALSE   FALSE
## Kenya                                       FALSE   FALSE
## Nepal                                       FALSE   FALSE
## Pakistan                                    FALSE   FALSE
## Myanmar                                     FALSE   FALSE
## Swaziland                                   FALSE   FALSE
## Tanzania (United Republic of)               FALSE   FALSE
## Cameroon                                    FALSE   FALSE
## Zimbabwe                                    FALSE   FALSE
## Mauritania                                  FALSE   FALSE
## Papua New Guinea                            FALSE   FALSE
## Yemen                                       FALSE   FALSE
## Lesotho                                     FALSE   FALSE
## Togo                                        FALSE   FALSE
## Haiti                                       FALSE   FALSE
## Rwanda                                      FALSE   FALSE
## Uganda                                      FALSE   FALSE
## Benin                                       FALSE   FALSE
## Sudan                                       FALSE   FALSE
## Senegal                                     FALSE   FALSE
## Afghanistan                                 FALSE   FALSE
## Côte d'Ivoire                               FALSE   FALSE
## Malawi                                      FALSE   FALSE
## Ethiopia                                    FALSE   FALSE
## Gambia                                      FALSE   FALSE
## Congo (Democratic Republic of the)          FALSE   FALSE
## Liberia                                     FALSE   FALSE
## Mali                                        FALSE   FALSE
## Mozambique                                  FALSE   FALSE
## Sierra Leone                                FALSE   FALSE
## Burkina Faso                                FALSE   FALSE
## Burundi                                     FALSE   FALSE
## Chad                                        FALSE   FALSE
## Central African Republic                    FALSE   FALSE
## Niger                                       FALSE   FALSE
# histogram per variable
pivot_longer(human_new, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# charts and correlations - data is not very visible
ggpairs(human_new,lower=list(combo=wrap("facethist",binwidth=0.1)))

# correlation between variables

c <- cor(human_new)
corrplot(c, method="circle")

HDI, Life exp, edu exp, edu mean is negatively correlated with GII, Mat mortality, ado birth HDI, Life exp; edu exp, edu mean is positively correlated with edu 2 of male and female, GII, ado birth and mat mortality are positvely correlated GII and math mortality are negatively correlated with edu for female an male

Question 2

Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

# pca
pca_human <- prcomp(human_new)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Question 3

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

# standardize human_new
scaled_human <- scale(human_new)

# new PCA
pca_human2 <- prcomp(scaled_human)

# draw a biplot of the principal component representation and the original variables with the scaled data
biplot(pca_human2, choices = 1:2)

# Question: How to make it readable? what is expected in this question?

# Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.
# yes because each variable have different weight in the model, that is why it is necessary to scale / standardize the data

# Let's reduce the dataset with only the last 30 countries and few of the variables. 
scaled_human2 <- scaled_human %>% tail(n=30)
scaled_human2 <- as.data.frame(scaled_human2)
scaled_human2 %>% dplyr::select(HDI,GNI,GII,Edu.Exp,Life.Exp,Mat.Mor,Ado.Birth,Edu2.F,Edu2.M)
##                                          HDI        GNI       GII    Edu.Exp
## Swaziland                          -1.116487 -0.6517472 0.9982202 -0.6605505
## Tanzania (United Republic of)      -1.181055 -0.8205902 0.9459891 -1.3999219
## Cameroon                           -1.239166 -0.7994511 1.1549134 -0.9774240
## Zimbabwe                           -1.258537 -0.8635155 0.7213955 -0.8013832
## Mauritania                         -1.277907 -0.7586289 1.2750449 -1.6463790
## Papua New Guinea                   -1.284364 -0.8177860 1.2802680 -1.1534648
## Yemen                              -1.329562 -0.7608399 1.9749414 -1.3999219
## Lesotho                            -1.336019 -0.7723262 0.9146505 -0.7309668
## Togo                               -1.419957 -0.8843849 1.1601365 -0.3436771
## Haiti                              -1.426414 -0.8606034 1.2384832 -1.5759627
## Rwanda                             -1.426414 -0.8719819 0.1781922 -1.0126321
## Uganda                             -1.426414 -0.8636233 0.8989811 -1.1886729
## Benin                              -1.445784 -0.8553187 1.2959373 -0.7309668
## Sudan                              -1.452241 -0.7452013 1.1758059 -2.1745014
## Senegal                            -1.536180 -0.8326157 0.8467501 -1.8576279
## Afghanistan                        -1.542637 -0.8489554 1.7085629 -1.3647137
## Côte d'Ivoire                      -1.562007 -0.7796062 1.6354394 -1.5055464
## Malawi                             -1.671773 -0.9103234 1.2802680 -0.8365913
## Ethiopia                           -1.691143 -0.8735997 1.0034433 -1.6463790
## Gambia                             -1.697600 -0.8693395 1.3377222 -1.5407545
## Congo (Democratic Republic of the) -1.749255 -0.9139365 1.6041007 -1.1886729
## Liberia                            -1.768625 -0.9071957 1.4891923 -1.2942974
## Mali                               -1.839650 -0.8652411 1.6249931 -1.6815871
## Mozambique                         -1.859020 -0.8900472 1.1758059 -1.3647137
## Sierra Leone                       -1.878391 -0.8546176 1.4839692 -1.6111708
## Burkina Faso                       -1.949416 -0.8648097 1.3847302 -1.8928361
## Burundi                            -1.962329 -0.9097302 0.6587182 -1.0830484
## Chad                               -2.013984 -0.8381701 1.7764633 -2.0336687
## Central African Republic           -2.285170 -0.9192752 1.5100848 -2.1040851
## Niger                              -2.298084 -0.9016413 1.8130250 -2.7378319
##                                      Life.Exp   Mat.Mor    Ado.Birth     Edu2.F
## Swaziland                          -2.7188401 0.7597924  0.604233786 -1.0822449
## Tanzania (United Republic of)      -0.7985475 1.2319592  1.837448830 -1.6092561
## Cameroon                           -1.9387212 2.0818593  1.669614830 -1.1016441
## Zimbabwe                           -1.6986846 1.5152592  0.319645699 -0.2157481
## Mauritania                         -1.0265822 0.8070091  0.635854684 -1.5219597
## Papua New Guinea                   -1.0865914 0.3348424  0.363428481 -1.5445921
## Yemen                              -0.9425694 0.5709257 -0.003860417 -1.5122601
## Lesotho                            -2.6228255 1.6096926  1.027467351 -1.0822449
## Togo                               -1.4346444 1.4208259  1.078547264 -1.2697704
## Haiti                              -1.0625877 1.0903092 -0.125479258 -1.0660789
## Rwanda                             -0.8945621 0.8070091 -0.329798910 -1.5316593
## Uganda                             -1.5786664 0.9958758  1.932311526 -1.0499130
## Benin                              -1.4466462 0.9014425  1.046926366 -1.4249638
## Sudan                              -0.9785749 0.9958758  0.896119003 -1.3990982
## Senegal                            -0.6185201 0.8070091  1.149086192 -1.5575249
## Afghanistan                        -1.3506316 1.1847425  0.964225554 -1.5995565
## Côte d'Ivoire                      -2.4187944 2.6956761  2.022309468 -1.3376675
## Malawi                             -1.0625877 1.7041259  2.375004106 -1.4314302
## Ethiopia                           -0.9065639 1.2791758  0.759905902 -1.5381257
## Gambia                             -1.3746353 1.3263925  1.669614830 -1.2277388
## Congo (Democratic Republic of the) -1.5546627 2.7428927  2.143928308 -1.3764659
## Liberia                            -1.2906225 2.3179427  1.708532859 -1.2924027
## Mali                               -1.6386755 1.8929926  3.124176164 -1.5413589
## Mozambique                         -1.9867285 1.5624759  2.204737729 -1.7450503
## Sierra Leone                       -2.4908053 4.4899097  1.302325931 -1.4669954
## Burkina Faso                       -1.5546627 1.1847425  1.659885323 -1.7612163
## Burundi                            -1.7946993 2.7901094 -0.410067345 -1.6189556
## Chad                               -2.4067925 3.9233096  2.550135236 -1.7353507
## Central African Republic           -2.5148090 3.4511428  1.243948888 -1.4637622
## Niger                              -1.2306133 2.2707260  3.834430193 -1.7127184
##                                         Edu2.M
## Swaziland                          -1.23222954
## Tanzania (United Republic of)      -1.81851973
## Cameroon                           -0.91598816
## Zimbabwe                            0.04694906
## Mauritania                         -1.41344651
## Papua New Guinea                   -1.64085604
## Yemen                              -1.20735662
## Lesotho                            -1.48095871
## Togo                               -0.72411137
## Haiti                              -0.90532834
## Rwanda                             -1.84339265
## Uganda                             -0.96573400
## Benin                              -1.19669680
## Sudan                              -1.50938490
## Senegal                            -1.60887657
## Afghanistan                        -1.09720513
## Côte d'Ivoire                      -1.08654531
## Malawi                             -1.38857359
## Ethiopia                           -1.50938490
## Gambia                             -1.03679948
## Congo (Democratic Republic of the) -1.00482001
## Liberia                            -0.75964411
## Mali                               -1.61953640
## Mozambique                         -1.93577777
## Sierra Leone                       -1.38502032
## Burkina Faso                       -2.04237599
## Burundi                            -1.86115902
## Chad                               -1.80430664
## Central African Republic           -1.20735662
## Niger                              -1.87892539
pca_human3 <- prcomp(scaled_human2)
biplot(pca_human3, choices = 1:2)

Question 4

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)

# Question: How to analyze when we cannot read it well?

Question 5

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

Load the tea dataset and convert its character variables to factors:

tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE) Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.

Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

tea_new <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
view(tea_new)
str(tea_new)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# variables are factorial, except few ones. let's map the different values with MCA

# But first let's keep only the categorical variables. We could also change the variables to categorical ones.
tea_new_2 <- tea_new %>% dplyr::select(-breakfast,-age)

library(ggplot2)
pivot_longer(tea_new_2, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

# too many variables, it is hard to see. 
# Question: how to select the variables we want to use? should they have less correlation?


library(FactoMineR)
mca <- MCA(tea_new_2, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_new_2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.092   0.084   0.072   0.060   0.057   0.055   0.050
## % of var.              5.913   5.392   4.629   3.881   3.672   3.502   3.188
## Cumulative % of var.   5.913  11.305  15.934  19.815  23.488  26.989  30.177
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.049   0.048   0.045   0.042   0.042   0.039   0.038
## % of var.              3.138   3.050   2.885   2.675   2.672   2.491   2.424
## Cumulative % of var.  33.315  36.365  39.251  41.926  44.598  47.089  49.513
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.036   0.036   0.034   0.032   0.032   0.031   0.030
## % of var.              2.320   2.305   2.173   2.077   2.049   2.016   1.944
## Cumulative % of var.  51.833  54.138  56.311  58.387  60.436  62.453  64.397
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.029   0.027   0.026   0.026   0.025   0.024   0.024
## % of var.              1.880   1.761   1.692   1.669   1.624   1.553   1.544
## Cumulative % of var.  66.276  68.038  69.730  71.399  73.023  74.577  76.120
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.023   0.022   0.022   0.021   0.020   0.019   0.019
## % of var.              1.456   1.426   1.404   1.339   1.264   1.246   1.221
## Cumulative % of var.  77.576  79.002  80.406  81.745  83.009  84.255  85.476
##                       Dim.36  Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.017   0.017   0.017   0.016   0.016   0.015   0.015
## % of var.              1.120   1.112   1.078   1.028   1.005   0.968   0.939
## Cumulative % of var.  86.596  87.708  88.786  89.814  90.819  91.786  92.725
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48  Dim.49
## Variance               0.013   0.013   0.012   0.012   0.011   0.010   0.010
## % of var.              0.866   0.837   0.802   0.739   0.697   0.674   0.666
## Cumulative % of var.  93.590  94.427  95.229  95.968  96.666  97.339  98.005
##                       Dim.50  Dim.51  Dim.52  Dim.53
## Variance               0.009   0.008   0.007   0.006
## % of var.              0.603   0.531   0.458   0.403
## Cumulative % of var.  98.608  99.139  99.597 100.000
## 
## Individuals (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1            | -0.629  1.432  0.202 |  0.148  0.086  0.011 |  0.109  0.055
## 2            | -0.427  0.660  0.139 |  0.288  0.330  0.064 | -0.110  0.056
## 3            |  0.097  0.034  0.006 | -0.156  0.096  0.015 |  0.125  0.073
## 4            | -0.560  1.133  0.227 | -0.279  0.308  0.056 | -0.026  0.003
## 5            | -0.173  0.108  0.028 | -0.149  0.088  0.020 |  0.027  0.003
## 6            | -0.681  1.675  0.272 | -0.293  0.340  0.050 | -0.008  0.000
## 7            | -0.223  0.181  0.037 |  0.015  0.001  0.000 |  0.183  0.155
## 8            | -0.031  0.003  0.001 |  0.111  0.048  0.009 | -0.097  0.043
## 9            | -0.063  0.014  0.003 |  0.266  0.281  0.048 |  0.388  0.697
## 10           |  0.177  0.114  0.021 |  0.369  0.539  0.090 |  0.315  0.459
##                cos2  
## 1             0.006 |
## 2             0.009 |
## 3             0.009 |
## 4             0.000 |
## 5             0.001 |
## 6             0.000 |
## 7             0.025 |
## 8             0.007 |
## 9             0.103 |
## 10            0.066 |
## 
## Categories (the 10 first)
##                 Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## Not.tea time | -0.552  4.238  0.236 -8.396 |  0.001  0.000  0.000  0.019 |
## tea time     |  0.428  3.285  0.236  8.396 | -0.001  0.000  0.000 -0.019 |
## evening      |  0.296  0.961  0.046  3.704 | -0.404  1.961  0.085 -5.051 |
## Not.evening  | -0.155  0.503  0.046 -3.704 |  0.211  1.025  0.085  5.051 |
## lunch        |  0.590  1.630  0.060  4.231 | -0.406  0.844  0.028 -2.907 |
## Not.lunch    | -0.101  0.280  0.060 -4.231 |  0.070  0.145  0.028  2.907 |
## dinner       | -1.047  2.447  0.082 -4.965 | -0.080  0.016  0.000 -0.379 |
## Not.dinner   |  0.079  0.184  0.082  4.965 |  0.006  0.001  0.000  0.379 |
## always       |  0.342  1.281  0.061  4.276 | -0.255  0.784  0.034 -3.193 |
## Not.always   | -0.179  0.670  0.061 -4.276 |  0.134  0.410  0.034  3.193 |
##               Dim.3    ctr   cos2 v.test  
## Not.tea time  0.063  0.071  0.003  0.959 |
## tea time     -0.049  0.055  0.003 -0.959 |
## evening       0.324  1.469  0.055  4.052 |
## Not.evening  -0.169  0.768  0.055 -4.052 |
## lunch         0.254  0.386  0.011  1.822 |
## Not.lunch    -0.044  0.066  0.011 -1.822 |
## dinner        0.744  1.581  0.042  3.531 |
## Not.dinner   -0.056  0.119  0.042 -3.531 |
## always        0.095  0.126  0.005  1.186 |
## Not.always   -0.050  0.066  0.005 -1.186 |
## 
## Categorical variables (eta2)
##                Dim.1 Dim.2 Dim.3  
## tea.time     | 0.236 0.000 0.003 |
## evening      | 0.046 0.085 0.055 |
## lunch        | 0.060 0.028 0.011 |
## dinner       | 0.082 0.000 0.042 |
## always       | 0.061 0.034 0.005 |
## home         | 0.013 0.002 0.027 |
## work         | 0.071 0.020 0.027 |
## tearoom      | 0.326 0.021 0.027 |
## friends      | 0.201 0.058 0.024 |
## resto        | 0.155 0.007 0.030 |
# visualize MCA - visualize many categories from many variables, all at once to see the ones that fit together as the same behavior. e.g unpackages and tea shop will belong to the same type of user.
plot(mca, invisible=c("ind"), graph.type = "classic")

# let's now choose only few variables: SELECT OTHER VARIABLES, QUESTION: ow to analyzes the correlation between only categorical variables to then select the ones i want to study with the mca
tea_new_2 %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ name, scales = "free") +
  geom_bar() +
  labs(title = "Distribution of Categorical Variables")

# Select smaller dataset to do the mca analyze
str(tea_new)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
tea_smaller <- tea_new %>% dplyr::select(exciting,relaxing,effect.on.health,sex,where,sugar)

# model
mca2 <- MCA(tea_smaller, graph = FALSE)

# summary of the model
summary(mca2)
## 
## Call:
## MCA(X = tea_smaller, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.226   0.211   0.183   0.160   0.156   0.123   0.108
## % of var.             19.413  18.045  15.680  13.683  13.381  10.541   9.257
## Cumulative % of var.  19.413  37.458  53.138  66.822  80.202  90.743 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                    |  0.385  0.219  0.158 | -0.267  0.113  0.076 | -0.087
## 2                    |  0.091  0.012  0.009 |  0.587  0.546  0.363 | -0.549
## 3                    | -0.667  0.655  0.722 | -0.123  0.024  0.024 | -0.007
## 4                    |  0.093  0.013  0.011 | -0.803  1.021  0.840 |  0.058
## 5                    | -0.223  0.073  0.067 | -0.384  0.234  0.198 |  0.212
## 6                    | -0.223  0.073  0.067 | -0.384  0.234  0.198 |  0.212
## 7                    | -0.223  0.073  0.067 | -0.384  0.234  0.198 |  0.212
## 8                    | -0.667  0.655  0.722 | -0.123  0.024  0.024 | -0.007
## 9                    | -0.362  0.193  0.117 | -0.051  0.004  0.002 |  0.264
## 10                   | -0.362  0.193  0.117 | -0.051  0.004  0.002 |  0.264
##                         ctr   cos2  
## 1                     0.014  0.008 |
## 2                     0.550  0.317 |
## 3                     0.000  0.000 |
## 4                     0.006  0.004 |
## 5                     0.082  0.060 |
## 6                     0.082  0.060 |
## 7                     0.082  0.060 |
## 8                     0.000  0.000 |
## 9                     0.127  0.062 |
## 10                    0.127  0.062 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## exciting             |   0.816  18.946   0.420  11.203 |   0.294   2.655
## No.exciting          |  -0.514  11.944   0.420 -11.203 |  -0.186   1.674
## No.relaxing          |   0.521   7.521   0.164   7.002 |   0.919  25.182
## relaxing             |  -0.315   4.545   0.164  -7.002 |  -0.555  15.217
## effect on health     |   0.518   4.342   0.076   4.756 |   0.504   4.430
## No.effect on health  |  -0.146   1.225   0.076  -4.756 |  -0.142   1.249
## F                    |  -0.516  11.626   0.389 -10.778 |   0.293   4.026
## M                    |   0.753  16.963   0.389  10.778 |  -0.427   5.875
## chain store          |   0.022   0.022   0.001   0.499 |  -0.304   4.693
## chain store+tea shop |  -0.377   2.717   0.050  -3.862 |   0.613   7.731
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## exciting               0.055   4.043 |  -0.627  13.851   0.248  -8.609 |
## No.exciting            0.055  -4.043 |   0.395   8.732   0.248   8.609 |
## No.relaxing            0.510  12.352 |  -0.231   1.831   0.032  -3.105 |
## relaxing               0.510 -12.352 |   0.140   1.106   0.032   3.105 |
## effect on health       0.072   4.631 |   0.866  15.046   0.212   7.956 |
## No.effect on health    0.072  -4.631 |  -0.244   4.244   0.212  -7.956 |
## F                      0.125   6.115 |  -0.228   2.821   0.076  -4.772 |
## M                      0.125  -6.115 |   0.333   4.116   0.076   4.772 |
## chain store            0.165  -7.017 |  -0.271   4.276   0.130  -6.243 |
## chain store+tea shop   0.132   6.282 |  -0.139   0.456   0.007  -1.422 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## exciting             | 0.420 0.055 0.248 |
## relaxing             | 0.164 0.510 0.032 |
## effect.on.health     | 0.076 0.072 0.212 |
## sex                  | 0.389 0.125 0.076 |
## where                | 0.108 0.170 0.490 |
## sugar                | 0.203 0.332 0.039 |
# visualize MCA 
plot(mca2, invisible=c("ind"), graph.type = "classic")

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.